Agent Evaluation & Readiness Guide

Before any Operanix agent reaches production, it must pass a rigorous evaluation framework. This guide covers 7-dimension quality scoring, golden set testing, regression detection, hallucination analysis, and the GO/NO-GO production gate.

Agent Readiness is the quality gate between development and production. No agent can be deployed without achieving minimum scores across all seven evaluation dimensions.

Evaluation Tabs Overview

The Agent Readiness module contains 10 tabs, each focused on a specific aspect of agent quality assurance:

#TabPurpose
1Readiness DashboardAggregate readiness score and dimension breakdown for each agent
2Evaluation RunsHistory of all evaluation executions with scores and trends
3Golden SetsManage curated question-answer pairs used as ground truth
4Dimension ScoresDetailed per-dimension analysis with examples and failure breakdowns
5Regression TestingAutomated comparison against previous evaluation baselines
6Hallucination ReportDedicated analysis of groundedness failures and unsupported claims
7Production GateGO/NO-GO decision with automated and manual checkpoints
8Feedback LoopProduction feedback integration for continuous improvement
9A/B ComparisonsSide-by-side comparison of agent versions or configurations
10Export & ReportsGenerate evaluation reports for stakeholders and compliance

7-Dimension Quality Scoring

Every agent response in an evaluation run is scored across seven dimensions. Each dimension uses a 0–1 scale with LLM-as-judge evaluation, calibrated against human-labeled reference sets.

1. Correctness (weight: 20%)

Measures whether the agent's answer is factually correct. Compared against the golden set's expected answer using both exact-match heuristics and semantic equivalence scoring. Partial credit is awarded for answers that are correct but incomplete.

2. Relevance (weight: 15%)

Evaluates whether the response directly addresses the user's question without unnecessary information. Penalizes tangential content, excessive caveats, and off-topic elaboration. A concise, on-point answer scores higher than a verbose but technically correct one.

3. Tone (weight: 10%)

Assesses whether the response matches the agent's configured persona and communication style. Checks for consistency in formality level, brand voice adherence, empathy markers, and professional language. Customizable per agent.

4. Safety (weight: 15%)

Detects harmful, inappropriate, or risky content in the response. Checks for toxic language, unauthorized advice (medical, legal, financial without disclaimers), PII exposure, and prompt injection leakage. Any critical safety failure results in an automatic score of 0.

5. Groundedness (weight: 20%)

Measures whether every claim in the response is supported by the retrieved knowledge chunks. Each statement is extracted and traced back to a source. Statements without source support are flagged as ungrounded. This is the primary anti-hallucination metric.

6. Faithfulness (weight: 10%)

Evaluates whether the response accurately represents the source material without distortion. Differs from groundedness: a response can be grounded (claims exist in sources) but unfaithful (misrepresents what the source says). Checks for cherry-picking, misquoting, and context manipulation.

7. Context Utilization (weight: 10%)

Measures how effectively the agent uses the retrieved context. Penalizes agents that ignore relevant context chunks or fail to synthesize information from multiple sources. Rewards comprehensive answers that draw on all available context.

Composite Score Calculation

Composite Score = (
    0.20 x Correctness +
    0.15 x Relevance +
    0.10 x Tone +
    0.15 x Safety +
    0.20 x Groundedness +
    0.10 x Faithfulness +
    0.10 x Context Utilization
)

// Minimum thresholds (any dimension below threshold = NO-GO)
Safety     >= 0.90
Groundedness >= 0.80
Correctness  >= 0.75
All others   >= 0.60

Golden Set Management

Golden sets are curated collections of question-answer pairs that serve as ground truth for evaluation. They are the foundation of repeatable, objective agent testing.

Golden Set Structure

{
  "id": "gs-support-v3",
  "agent": "customer-support",
  "version": 3,
  "cases": [
    {
      "id": "case-001",
      "question": "How do I reset my password?",
      "expected_answer": "Go to Settings > Security > Reset Password...",
      "context_chunks": ["kb-auth-001", "kb-auth-002"],
      "tags": ["authentication", "self-service"],
      "difficulty": "easy",
      "created_by": "jane@company.com",
      "created_at": "2026-03-15T10:00:00Z"
    }
  ]
}

Golden Set Best Practices

Golden sets should be created by domain experts, not by the people who built the agent. This separation prevents confirmation bias in evaluation.

Regression Testing

Regression testing automatically compares new evaluation results against a baseline to detect quality degradation before it reaches production.

How Regression Detection Works

Regression Report

MetricBaselineCurrentDeltaStatus
Composite0.870.85-0.02Pass
Correctness0.890.84-0.05Warning
Safety0.960.97+0.01Pass
Groundedness0.880.82-0.06Warning
Regressed casescase-017, case-034, case-089Investigate

Hallucination Detection

The Hallucination Report provides deep analysis of every instance where an agent made claims not supported by its knowledge base.

Hallucination Categories

Hallucination Trace

Each detected hallucination includes a trace showing:

Hallucination rate is tracked as a key metric over time. The target is less than 3% of evaluated responses containing any hallucinated claim. Agents exceeding 5% are automatically blocked from production deployment.

GO/NO-GO Production Gate

The Production Gate is the final checkpoint before an agent can be deployed to production. It combines automated scoring with manual review checkpoints.

Automated Checks

CheckThresholdWeight
Composite score≥ 0.80Required
Safety score≥ 0.90Required
Groundedness score≥ 0.80Required
Correctness score≥ 0.75Required
Hallucination rate≤ 5%Required
Regression testNo critical regressionsRequired
Knowledge coverage≥ 90% domains coveredAdvisory
Response latency P95≤ 3 secondsAdvisory

Manual Review Checkpoints

Gate Decision

A NO-GO decision requires a new evaluation run after fixes are applied. You cannot override a NO-GO; the agent must actually pass the automated thresholds.

Feedback Loop

The Feedback Loop tab connects production performance data back into the evaluation framework, creating a continuous improvement cycle.

Feedback Sources

Feedback-to-Evaluation Pipeline

The feedback loop is what makes Operanix evaluations improve over time. Agents that have been in production for 3+ months typically see a 15-20% improvement in their evaluation scores compared to their initial deployment, driven by feedback-informed golden set expansion.

A/B Comparisons

Compare two agent versions side-by-side on the same golden set to make data-driven decisions about configuration changes.

Best Practices