Knowledge Operations Guide

The Operanix Knowledge Operations pipeline transforms raw enterprise data into verified, agent-ready knowledge. This guide covers the complete 7-step pipeline, two-layer RAG architecture, compliance gates, and tenant intelligence system.

Knowledge Operations is the foundation of every Operanix agent. Agents can only answer questions grounded in knowledge that has passed through this pipeline, ensuring accuracy and compliance at every step.

Pipeline Overview

The knowledge pipeline processes enterprise data through seven sequential stages, each with built-in quality gates and compliance checks. Data enters as raw sources and exits as verified, indexed knowledge ready for agent consumption.

Step	Stage	Purpose	Output
1	Sources	Connect and configure data sources	Raw content streams
2	Review	Human review of extracted content	Approved documents
3	Agent Coverage	Map knowledge to agent domains	Coverage assignments
4	Training & Eval	Fine-tune retrieval and validate quality	Trained embeddings
5	Tenant Intelligence	12-stage deep enrichment pipeline	Enriched knowledge graph
6	Publish	Deploy to production with compliance sign-off	Live knowledge base
7	Pipeline Runs	Monitor execution, logs, and health	Audit trail

Step 1: Sources

The Sources tab is where you connect enterprise data to the knowledge pipeline. Operanix supports a wide range of source types, each with configurable crawl schedules and extraction settings.

Supported Source Types

Web Crawl — Provide a root URL and Operanix recursively crawls the site, respecting robots.txt and rate limits. Configurable depth (1–5 levels), URL patterns for include/exclude, and automatic sitemap detection.
Document Upload — Upload PDF, DOCX, XLSX, PPTX, TXT, CSV, and Markdown files directly. Files are parsed with layout-aware extraction that preserves tables, headings, and list structure.
API Connector — Pull content from REST APIs on a schedule. Supports OAuth 2.0, API key, and bearer token authentication. JSON response mapping lets you specify which fields contain the content body, title, and metadata.
Knowledge Base Sync — Connect to existing knowledge bases (Confluence, Notion, SharePoint, Google Drive) via native integrations. Incremental sync detects only changed pages.
CRM & Ticketing — Import resolved support tickets, FAQ entries, and product documentation from Zendesk, Salesforce, HubSpot, and Intercom.

Crawl Configuration

Scheduling

Each source can be configured with a crawl schedule: hourly, daily, weekly, or custom cron expressions. The pipeline tracks content hashes to skip unchanged documents, minimizing compute and API costs.

{
  "source": "web_crawl",
  "url": "https://docs.example.com",
  "schedule": "0 2 * * *",
  "depth": 3,
  "include_patterns": ["/docs/*", "/api/*"],
  "exclude_patterns": ["/blog/*", "/changelog/*"],
  "respect_robots": true,
  "max_pages": 500
}

Step 2: Review

Every piece of extracted content enters the Review queue before it can proceed through the pipeline. This human-in-the-loop stage ensures that only relevant, accurate, and appropriate content reaches your agents.

Review Workflow

Auto-classification — Incoming content is automatically categorized by topic, sensitivity level, and relevance score. Low-relevance content is flagged for quick rejection.
Content preview — Reviewers see the extracted text alongside the original source for comparison. Extraction errors (broken tables, missing sections) are highlighted.
Bulk actions — Select multiple items to approve, reject, or flag for re-extraction in batch.
PII detection — Automatic scanning for personally identifiable information (emails, phone numbers, SSNs, credit card numbers) with inline redaction tools.
Compliance tagging — Reviewers can tag content with compliance labels (HIPAA, PCI, GDPR, internal-only) that control downstream access.

Content that contains detected PII is held in the review queue and cannot be published until the PII is redacted or an authorized compliance officer approves the exception.

Step 3: Agent Coverage

After review, approved content must be mapped to one or more agents. The Agent Coverage tab provides a matrix view showing which knowledge domains are assigned to which agents.

Coverage Matrix

The coverage matrix displays agents on one axis and knowledge domains on the other. Each cell shows a coverage status:

Full coverage — Agent has all relevant knowledge for this domain.
Partial coverage — Some knowledge is assigned but gaps exist. The system identifies specific missing topics.
No coverage — Domain is not assigned to this agent.
Overlap warning — Multiple agents cover the same domain, which may cause conflicting answers. Flagged for resolution.

Auto-Assignment

Enable auto-assignment to let Operanix automatically map new knowledge to agents based on their configured specialization domains. Auto-assigned content still appears in the coverage dashboard for manual review.

Step 4: Training & Evaluation

Once knowledge is mapped to agents, the Training & Eval stage validates that agents can actually retrieve and use the knowledge correctly.

Retrieval Training

Embedding generation — Content is chunked using semantic boundary detection (not fixed-size splitting) and embedded using the configured model. Chunk sizes adapt to content type: shorter for FAQ pairs, longer for technical documentation.
Index optimization — Vector indices are built with HNSW for fast approximate nearest-neighbor search. The system benchmarks retrieval accuracy against a golden test set before promoting new indices.
Query simulation — Synthetic queries are generated from the knowledge content to test retrieval paths. Queries that fail to retrieve the correct chunks are flagged.

Evaluation Metrics

Metric	Target	Description
Recall@5	≥ 0.90	Correct chunk appears in top 5 results
MRR	≥ 0.80	Mean reciprocal rank of correct chunk
Latency P95	≤ 200ms	95th percentile retrieval time
Chunk relevance	≥ 0.85	LLM-judged relevance of top chunk to query

Step 5: Tenant Intelligence

The Tenant Intelligence pipeline is a 12-stage deep enrichment process that transforms raw knowledge into a richly connected knowledge graph. This is the most compute-intensive stage and runs asynchronously.

12-Stage Pipeline

Stages 1–4: Extraction

Stage 1: Entity extraction — Identifies products, features, people, organizations, dates, and domain-specific entities using NER models tuned to your industry.
Stage 2: Relationship mapping — Detects relationships between entities (e.g., "Product X integrates with Service Y") and builds an entity graph.
Stage 3: Topic clustering — Groups related chunks into coherent topics using hierarchical clustering. Topics become navigable categories in the knowledge base.
Stage 4: Sentiment & intent analysis — Tags content with sentiment polarity and detected user intent (informational, transactional, navigational).

Stages 5–8: Enrichment

Stage 5: Gap detection — Identifies topics mentioned but not fully covered. Generates gap reports with suggested content to author.
Stage 6: Contradiction detection — Cross-references facts across documents to find conflicting statements (e.g., different pricing on two pages).
Stage 7: Freshness scoring — Assigns decay scores based on content age, update frequency, and domain volatility. Stale content is flagged for re-crawl or manual update.
Stage 8: Cross-reference linking — Creates bidirectional links between related chunks, enabling agents to follow context chains when answering complex queries.

Stages 9–12: Quality & Compliance

Stage 9: Deduplication — Deterministic deduplication using content hashing (SHA-256) and semantic similarity. Near-duplicates (similarity > 0.92) are merged, preserving the most recent version.
Stage 10: Compliance classification — Automated classification against configured compliance frameworks (SOC 2, HIPAA, GDPR, PCI-DSS). Content that triggers compliance rules is routed to the compliance gate.
Stage 11: Quality scoring — Each chunk receives a composite quality score based on completeness, clarity, accuracy confidence, and source authority.
Stage 12: Index promotion — Final stage packages the enriched knowledge graph and promotes it to the production index with a versioned snapshot for rollback.

Two-Layer RAG Architecture

Operanix uses a two-layer retrieval-augmented generation (RAG) architecture that combines structured entity retrieval with document chunk retrieval for maximum accuracy.

Layer 1: Structured Entities

The first retrieval layer queries the entity graph built during Tenant Intelligence. When an agent receives a question, it first identifies relevant entities (products, features, policies) and retrieves their structured attributes and relationships. This layer provides precise, factual answers for entity-centric queries.

Layer 2: Document Chunks

The second layer performs traditional vector similarity search against the chunk index. Results from both layers are merged, re-ranked using a cross-encoder model, and passed to the LLM with source attribution metadata.

The two-layer approach improves answer accuracy by 23% on entity-centric queries compared to chunk-only RAG, while maintaining equivalent performance on open-ended questions.

Retrieval Flow

User Query
  |
  v
[Query Analysis] -- extract entities, intent, keywords
  |
  +--> [Layer 1: Entity Graph] -- structured lookup
  |         |
  +--> [Layer 2: Vector Search] -- semantic similarity
  |         |
  v         v
[Merge & Re-rank] -- cross-encoder scoring
  |
  v
[Compliance Filter] -- remove restricted content
  |
  v
[LLM Generation] -- grounded response with citations

Compliance Gate

The compliance gate is a mandatory checkpoint that sits between the knowledge pipeline and production deployment. No knowledge reaches agents without passing through this gate.

Gate Checks

PII scan — Final automated scan for any PII that survived the review stage. Uses pattern matching plus a fine-tuned NER model for high recall.
Sensitivity classification — Content is classified as public, internal, confidential, or restricted. Agents can only access content at or below their clearance level.
Regulatory tagging — Content touching regulated domains (healthcare, finance, legal) is tagged with applicable regulations and requires domain-expert approval.
Source verification — The gate verifies that all content has a traceable source URL or document reference. Orphaned content without provenance is blocked.
Freshness check — Content older than the configured TTL (default: 90 days) is flagged for re-validation before publication.

The compliance gate cannot be bypassed. Even admin users must go through the gate. All gate decisions are logged to the immutable audit trail with the reviewer's identity and timestamp.

Deterministic Deduplication

Operanix employs a two-phase deduplication strategy to prevent duplicate knowledge from reaching agents:

Phase 1: Hash-Based (Exact Match)

Every ingested document and chunk is assigned a SHA-256 content hash. Before insertion, the hash is checked against the existing index. Exact matches are skipped immediately with zero compute overhead.

Phase 2: Semantic Similarity (Near-Duplicate)

For content that passes hash deduplication, a fast embedding comparison identifies near-duplicates. Content pairs with cosine similarity above 0.92 are flagged. The system preserves the version with the higher quality score and more recent timestamp, creating a merge record in the audit trail.

// Deduplication decision logic
if (contentHash === existingHash) {
  skip("exact_duplicate");
} else if (cosineSimilarity(embedding, existingEmbedding) > 0.92) {
  if (newQualityScore > existingQualityScore) {
    replace(existing, newContent);
    audit("near_duplicate_replaced", { reason: "higher_quality" });
  } else {
    skip("near_duplicate_lower_quality");
  }
} else {
  insert(newContent);
}

Step 6: Publish

The Publish stage deploys reviewed, enriched, and compliance-cleared knowledge to the production environment.

Versioned deployment — Each publish creates a versioned snapshot. You can roll back to any previous version instantly.
Staged rollout — Optionally publish to a canary environment first, routing a percentage of agent queries to the new knowledge while monitoring quality metrics.
Approval workflow — Configurable multi-stage approvals. Default requires at least one knowledge reviewer and one compliance officer sign-off.
Notification — Stakeholders are notified via email and in-app notification when knowledge is published, with a summary of changes.

Step 7: Pipeline Runs

The Pipeline Runs tab provides full observability into every execution of the knowledge pipeline.

Run Dashboard

Run history — Complete list of pipeline runs with status (success, partial, failed), duration, documents processed, and error counts.
Stage-level logs — Drill into any run to see per-stage execution logs, timing, and output statistics.
Error investigation — Failed stages show detailed error messages with suggested remediation. Common errors (timeout, auth failure, rate limit) have one-click retry.
Re-run controls — Re-run the entire pipeline or individual stages from the dashboard. Partial re-runs resume from the failed stage.
Health metrics — Pipeline health score based on success rate, average duration, and error trends over the last 30 days.

Pipeline runs are integrated with the Operanix audit trail. Every run, stage execution, approval, and publish event is recorded with full traceability for compliance audits.

Best Practices

Start with a single high-quality source and expand incrementally. Verify retrieval accuracy before adding more sources.
Configure PII detection rules before your first crawl to catch sensitive content early.
Use the coverage matrix weekly to identify knowledge gaps before they become agent blind spots.
Set freshness TTLs appropriate to your domain: 30 days for pricing/product pages, 90 days for documentation, 180 days for policy documents.
Monitor the Tenant Intelligence pipeline for contradiction alerts — they often indicate outdated content that needs updating at the source.
Always use staged rollout for large knowledge updates (50+ documents) to catch quality regressions before they affect all users.