Knowledge Operations Guide
The Operanix Knowledge Operations pipeline transforms raw enterprise data into verified, agent-ready knowledge. This guide covers the complete 7-step pipeline, two-layer RAG architecture, compliance gates, and tenant intelligence system.
Pipeline Overview
The knowledge pipeline processes enterprise data through seven sequential stages, each with built-in quality gates and compliance checks. Data enters as raw sources and exits as verified, indexed knowledge ready for agent consumption.
| Step | Stage | Purpose | Output |
|---|---|---|---|
| 1 | Sources | Connect and configure data sources | Raw content streams |
| 2 | Review | Human review of extracted content | Approved documents |
| 3 | Agent Coverage | Map knowledge to agent domains | Coverage assignments |
| 4 | Training & Eval | Fine-tune retrieval and validate quality | Trained embeddings |
| 5 | Tenant Intelligence | 12-stage deep enrichment pipeline | Enriched knowledge graph |
| 6 | Publish | Deploy to production with compliance sign-off | Live knowledge base |
| 7 | Pipeline Runs | Monitor execution, logs, and health | Audit trail |
Step 1: Sources
The Sources tab is where you connect enterprise data to the knowledge pipeline. Operanix supports a wide range of source types, each with configurable crawl schedules and extraction settings.
Supported Source Types
- Web Crawl — Provide a root URL and Operanix recursively crawls the site, respecting
robots.txtand rate limits. Configurable depth (1–5 levels), URL patterns for include/exclude, and automatic sitemap detection. - Document Upload — Upload PDF, DOCX, XLSX, PPTX, TXT, CSV, and Markdown files directly. Files are parsed with layout-aware extraction that preserves tables, headings, and list structure.
- API Connector — Pull content from REST APIs on a schedule. Supports OAuth 2.0, API key, and bearer token authentication. JSON response mapping lets you specify which fields contain the content body, title, and metadata.
- Knowledge Base Sync — Connect to existing knowledge bases (Confluence, Notion, SharePoint, Google Drive) via native integrations. Incremental sync detects only changed pages.
- CRM & Ticketing — Import resolved support tickets, FAQ entries, and product documentation from Zendesk, Salesforce, HubSpot, and Intercom.
Crawl Configuration
Scheduling
Each source can be configured with a crawl schedule: hourly, daily, weekly, or custom cron expressions. The pipeline tracks content hashes to skip unchanged documents, minimizing compute and API costs.
{
"source": "web_crawl",
"url": "https://docs.example.com",
"schedule": "0 2 * * *",
"depth": 3,
"include_patterns": ["/docs/*", "/api/*"],
"exclude_patterns": ["/blog/*", "/changelog/*"],
"respect_robots": true,
"max_pages": 500
}
Step 2: Review
Every piece of extracted content enters the Review queue before it can proceed through the pipeline. This human-in-the-loop stage ensures that only relevant, accurate, and appropriate content reaches your agents.
Review Workflow
- Auto-classification — Incoming content is automatically categorized by topic, sensitivity level, and relevance score. Low-relevance content is flagged for quick rejection.
- Content preview — Reviewers see the extracted text alongside the original source for comparison. Extraction errors (broken tables, missing sections) are highlighted.
- Bulk actions — Select multiple items to approve, reject, or flag for re-extraction in batch.
- PII detection — Automatic scanning for personally identifiable information (emails, phone numbers, SSNs, credit card numbers) with inline redaction tools.
- Compliance tagging — Reviewers can tag content with compliance labels (HIPAA, PCI, GDPR, internal-only) that control downstream access.
Step 3: Agent Coverage
After review, approved content must be mapped to one or more agents. The Agent Coverage tab provides a matrix view showing which knowledge domains are assigned to which agents.
Coverage Matrix
The coverage matrix displays agents on one axis and knowledge domains on the other. Each cell shows a coverage status:
- Full coverage — Agent has all relevant knowledge for this domain.
- Partial coverage — Some knowledge is assigned but gaps exist. The system identifies specific missing topics.
- No coverage — Domain is not assigned to this agent.
- Overlap warning — Multiple agents cover the same domain, which may cause conflicting answers. Flagged for resolution.
Auto-Assignment
Enable auto-assignment to let Operanix automatically map new knowledge to agents based on their configured specialization domains. Auto-assigned content still appears in the coverage dashboard for manual review.
Step 4: Training & Evaluation
Once knowledge is mapped to agents, the Training & Eval stage validates that agents can actually retrieve and use the knowledge correctly.
Retrieval Training
- Embedding generation — Content is chunked using semantic boundary detection (not fixed-size splitting) and embedded using the configured model. Chunk sizes adapt to content type: shorter for FAQ pairs, longer for technical documentation.
- Index optimization — Vector indices are built with HNSW for fast approximate nearest-neighbor search. The system benchmarks retrieval accuracy against a golden test set before promoting new indices.
- Query simulation — Synthetic queries are generated from the knowledge content to test retrieval paths. Queries that fail to retrieve the correct chunks are flagged.
Evaluation Metrics
| Metric | Target | Description |
|---|---|---|
| Recall@5 | ≥ 0.90 | Correct chunk appears in top 5 results |
| MRR | ≥ 0.80 | Mean reciprocal rank of correct chunk |
| Latency P95 | ≤ 200ms | 95th percentile retrieval time |
| Chunk relevance | ≥ 0.85 | LLM-judged relevance of top chunk to query |
Step 5: Tenant Intelligence
The Tenant Intelligence pipeline is a 12-stage deep enrichment process that transforms raw knowledge into a richly connected knowledge graph. This is the most compute-intensive stage and runs asynchronously.
12-Stage Pipeline
Stages 1–4: Extraction
- Stage 1: Entity extraction — Identifies products, features, people, organizations, dates, and domain-specific entities using NER models tuned to your industry.
- Stage 2: Relationship mapping — Detects relationships between entities (e.g., "Product X integrates with Service Y") and builds an entity graph.
- Stage 3: Topic clustering — Groups related chunks into coherent topics using hierarchical clustering. Topics become navigable categories in the knowledge base.
- Stage 4: Sentiment & intent analysis — Tags content with sentiment polarity and detected user intent (informational, transactional, navigational).
Stages 5–8: Enrichment
- Stage 5: Gap detection — Identifies topics mentioned but not fully covered. Generates gap reports with suggested content to author.
- Stage 6: Contradiction detection — Cross-references facts across documents to find conflicting statements (e.g., different pricing on two pages).
- Stage 7: Freshness scoring — Assigns decay scores based on content age, update frequency, and domain volatility. Stale content is flagged for re-crawl or manual update.
- Stage 8: Cross-reference linking — Creates bidirectional links between related chunks, enabling agents to follow context chains when answering complex queries.
Stages 9–12: Quality & Compliance
- Stage 9: Deduplication — Deterministic deduplication using content hashing (SHA-256) and semantic similarity. Near-duplicates (similarity > 0.92) are merged, preserving the most recent version.
- Stage 10: Compliance classification — Automated classification against configured compliance frameworks (SOC 2, HIPAA, GDPR, PCI-DSS). Content that triggers compliance rules is routed to the compliance gate.
- Stage 11: Quality scoring — Each chunk receives a composite quality score based on completeness, clarity, accuracy confidence, and source authority.
- Stage 12: Index promotion — Final stage packages the enriched knowledge graph and promotes it to the production index with a versioned snapshot for rollback.
Two-Layer RAG Architecture
Operanix uses a two-layer retrieval-augmented generation (RAG) architecture that combines structured entity retrieval with document chunk retrieval for maximum accuracy.
Layer 1: Structured Entities
The first retrieval layer queries the entity graph built during Tenant Intelligence. When an agent receives a question, it first identifies relevant entities (products, features, policies) and retrieves their structured attributes and relationships. This layer provides precise, factual answers for entity-centric queries.
Layer 2: Document Chunks
The second layer performs traditional vector similarity search against the chunk index. Results from both layers are merged, re-ranked using a cross-encoder model, and passed to the LLM with source attribution metadata.
Retrieval Flow
User Query
|
v
[Query Analysis] -- extract entities, intent, keywords
|
+--> [Layer 1: Entity Graph] -- structured lookup
| |
+--> [Layer 2: Vector Search] -- semantic similarity
| |
v v
[Merge & Re-rank] -- cross-encoder scoring
|
v
[Compliance Filter] -- remove restricted content
|
v
[LLM Generation] -- grounded response with citations
Compliance Gate
The compliance gate is a mandatory checkpoint that sits between the knowledge pipeline and production deployment. No knowledge reaches agents without passing through this gate.
Gate Checks
- PII scan — Final automated scan for any PII that survived the review stage. Uses pattern matching plus a fine-tuned NER model for high recall.
- Sensitivity classification — Content is classified as public, internal, confidential, or restricted. Agents can only access content at or below their clearance level.
- Regulatory tagging — Content touching regulated domains (healthcare, finance, legal) is tagged with applicable regulations and requires domain-expert approval.
- Source verification — The gate verifies that all content has a traceable source URL or document reference. Orphaned content without provenance is blocked.
- Freshness check — Content older than the configured TTL (default: 90 days) is flagged for re-validation before publication.
Deterministic Deduplication
Operanix employs a two-phase deduplication strategy to prevent duplicate knowledge from reaching agents:
Phase 1: Hash-Based (Exact Match)
Every ingested document and chunk is assigned a SHA-256 content hash. Before insertion, the hash is checked against the existing index. Exact matches are skipped immediately with zero compute overhead.
Phase 2: Semantic Similarity (Near-Duplicate)
For content that passes hash deduplication, a fast embedding comparison identifies near-duplicates. Content pairs with cosine similarity above 0.92 are flagged. The system preserves the version with the higher quality score and more recent timestamp, creating a merge record in the audit trail.
// Deduplication decision logic
if (contentHash === existingHash) {
skip("exact_duplicate");
} else if (cosineSimilarity(embedding, existingEmbedding) > 0.92) {
if (newQualityScore > existingQualityScore) {
replace(existing, newContent);
audit("near_duplicate_replaced", { reason: "higher_quality" });
} else {
skip("near_duplicate_lower_quality");
}
} else {
insert(newContent);
}
Step 6: Publish
The Publish stage deploys reviewed, enriched, and compliance-cleared knowledge to the production environment.
- Versioned deployment — Each publish creates a versioned snapshot. You can roll back to any previous version instantly.
- Staged rollout — Optionally publish to a canary environment first, routing a percentage of agent queries to the new knowledge while monitoring quality metrics.
- Approval workflow — Configurable multi-stage approvals. Default requires at least one knowledge reviewer and one compliance officer sign-off.
- Notification — Stakeholders are notified via email and in-app notification when knowledge is published, with a summary of changes.
Step 7: Pipeline Runs
The Pipeline Runs tab provides full observability into every execution of the knowledge pipeline.
Run Dashboard
- Run history — Complete list of pipeline runs with status (success, partial, failed), duration, documents processed, and error counts.
- Stage-level logs — Drill into any run to see per-stage execution logs, timing, and output statistics.
- Error investigation — Failed stages show detailed error messages with suggested remediation. Common errors (timeout, auth failure, rate limit) have one-click retry.
- Re-run controls — Re-run the entire pipeline or individual stages from the dashboard. Partial re-runs resume from the failed stage.
- Health metrics — Pipeline health score based on success rate, average duration, and error trends over the last 30 days.
Best Practices
- Start with a single high-quality source and expand incrementally. Verify retrieval accuracy before adding more sources.
- Configure PII detection rules before your first crawl to catch sensitive content early.
- Use the coverage matrix weekly to identify knowledge gaps before they become agent blind spots.
- Set freshness TTLs appropriate to your domain: 30 days for pricing/product pages, 90 days for documentation, 180 days for policy documents.
- Monitor the Tenant Intelligence pipeline for contradiction alerts — they often indicate outdated content that needs updating at the source.
- Always use staged rollout for large knowledge updates (50+ documents) to catch quality regressions before they affect all users.