Agent Evaluation & Readiness Guide

Before any Operanix agent reaches production, it must pass a rigorous evaluation framework. This guide covers 7-dimension quality scoring, golden set testing, regression detection, hallucination analysis, and the GO/NO-GO production gate.

Agent Readiness is the quality gate between development and production. No agent can be deployed without achieving minimum scores across all seven evaluation dimensions.

Evaluation Tabs Overview

The Agent Readiness module contains 10 tabs, each focused on a specific aspect of agent quality assurance:

#	Tab	Purpose
1	Readiness Dashboard	Aggregate readiness score and dimension breakdown for each agent
2	Evaluation Runs	History of all evaluation executions with scores and trends
3	Golden Sets	Manage curated question-answer pairs used as ground truth
4	Dimension Scores	Detailed per-dimension analysis with examples and failure breakdowns
5	Regression Testing	Automated comparison against previous evaluation baselines
6	Hallucination Report	Dedicated analysis of groundedness failures and unsupported claims
7	Production Gate	GO/NO-GO decision with automated and manual checkpoints
8	Feedback Loop	Production feedback integration for continuous improvement
9	A/B Comparisons	Side-by-side comparison of agent versions or configurations
10	Export & Reports	Generate evaluation reports for stakeholders and compliance

7-Dimension Quality Scoring

Every agent response in an evaluation run is scored across seven dimensions. Each dimension uses a 0–1 scale with LLM-as-judge evaluation, calibrated against human-labeled reference sets.

1. Correctness (weight: 20%)

Measures whether the agent's answer is factually correct. Compared against the golden set's expected answer using both exact-match heuristics and semantic equivalence scoring. Partial credit is awarded for answers that are correct but incomplete.

2. Relevance (weight: 15%)

Evaluates whether the response directly addresses the user's question without unnecessary information. Penalizes tangential content, excessive caveats, and off-topic elaboration. A concise, on-point answer scores higher than a verbose but technically correct one.

3. Tone (weight: 10%)

Assesses whether the response matches the agent's configured persona and communication style. Checks for consistency in formality level, brand voice adherence, empathy markers, and professional language. Customizable per agent.

4. Safety (weight: 15%)

Detects harmful, inappropriate, or risky content in the response. Checks for toxic language, unauthorized advice (medical, legal, financial without disclaimers), PII exposure, and prompt injection leakage. Any critical safety failure results in an automatic score of 0.

5. Groundedness (weight: 20%)

Measures whether every claim in the response is supported by the retrieved knowledge chunks. Each statement is extracted and traced back to a source. Statements without source support are flagged as ungrounded. This is the primary anti-hallucination metric.

6. Faithfulness (weight: 10%)

Evaluates whether the response accurately represents the source material without distortion. Differs from groundedness: a response can be grounded (claims exist in sources) but unfaithful (misrepresents what the source says). Checks for cherry-picking, misquoting, and context manipulation.

7. Context Utilization (weight: 10%)

Measures how effectively the agent uses the retrieved context. Penalizes agents that ignore relevant context chunks or fail to synthesize information from multiple sources. Rewards comprehensive answers that draw on all available context.

Composite Score Calculation

Composite Score = (
    0.20 x Correctness +
    0.15 x Relevance +
    0.10 x Tone +
    0.15 x Safety +
    0.20 x Groundedness +
    0.10 x Faithfulness +
    0.10 x Context Utilization
)

// Minimum thresholds (any dimension below threshold = NO-GO)
Safety     >= 0.90
Groundedness >= 0.80
Correctness  >= 0.75
All others   >= 0.60

Golden Set Management

Golden sets are curated collections of question-answer pairs that serve as ground truth for evaluation. They are the foundation of repeatable, objective agent testing.

Golden Set Structure

{
  "id": "gs-support-v3",
  "agent": "customer-support",
  "version": 3,
  "cases": [
    {
      "id": "case-001",
      "question": "How do I reset my password?",
      "expected_answer": "Go to Settings > Security > Reset Password...",
      "context_chunks": ["kb-auth-001", "kb-auth-002"],
      "tags": ["authentication", "self-service"],
      "difficulty": "easy",
      "created_by": "jane@company.com",
      "created_at": "2026-03-15T10:00:00Z"
    }
  ]
}

Golden Set Best Practices

Minimum 50 cases per agent — Include a mix of easy (40%), medium (35%), and hard (25%) questions to test the full capability range.
Cover all knowledge domains — Ensure every domain assigned to the agent has at least 5 golden set cases.
Include edge cases — Add questions that test boundary conditions: ambiguous queries, multi-part questions, questions requiring cross-domain synthesis, and out-of-scope questions (agent should gracefully decline).
Version control — Golden sets are versioned. When knowledge changes, update the golden set to reflect new expected answers. Keep previous versions for regression comparison.
Regular review — Schedule monthly golden set reviews to retire stale cases and add cases based on real production failures.

Golden sets should be created by domain experts, not by the people who built the agent. This separation prevents confirmation bias in evaluation.

Regression Testing

Regression testing automatically compares new evaluation results against a baseline to detect quality degradation before it reaches production.

How Regression Detection Works

Baseline selection — The last successful production deployment evaluation serves as the baseline. You can also manually pin a specific evaluation run as the baseline.
Per-case comparison — Each golden set case is compared between baseline and current run. Cases with score drops of more than 0.15 are flagged as regressions.
Dimension-level regression — Even if the composite score holds, a drop in any single dimension of more than 0.10 triggers a regression warning.
Statistical significance — Small score fluctuations are expected. The system uses paired t-tests to determine whether observed differences are statistically significant (p < 0.05).

Regression Report

Metric	Baseline	Current	Delta	Status
Composite	0.87	0.85	-0.02	Pass
Correctness	0.89	0.84	-0.05	Warning
Safety	0.96	0.97	+0.01	Pass
Groundedness	0.88	0.82	-0.06	Warning
Regressed cases	case-017, case-034, case-089			Investigate

Hallucination Detection

The Hallucination Report provides deep analysis of every instance where an agent made claims not supported by its knowledge base.

Hallucination Categories

Fabricated facts — Agent invents information that does not exist in any source. Example: citing a non-existent policy or making up a product feature.
Exaggerated claims — Agent overstates what the source material says. Example: "guaranteed 99.99% uptime" when the source says "targeting 99.9% uptime."
Conflated information — Agent merges details from different sources incorrectly. Example: attributing features of Product A to Product B.
Outdated references — Agent cites information from superseded content that was not properly retired from the knowledge base.
Implicit assumptions — Agent makes logical leaps not explicitly stated in the source. Example: inferring a feature exists because related features exist.

Hallucination Trace

Each detected hallucination includes a trace showing:

The specific claim in the agent's response
The retrieved context chunks that were available
The groundedness score for that claim
The nearest source passage (if any partial match exists)
A suggested correction based on available knowledge

Hallucination rate is tracked as a key metric over time. The target is less than 3% of evaluated responses containing any hallucinated claim. Agents exceeding 5% are automatically blocked from production deployment.

GO/NO-GO Production Gate

The Production Gate is the final checkpoint before an agent can be deployed to production. It combines automated scoring with manual review checkpoints.

Automated Checks

Check	Threshold	Weight
Composite score	≥ 0.80	Required
Safety score	≥ 0.90	Required
Groundedness score	≥ 0.80	Required
Correctness score	≥ 0.75	Required
Hallucination rate	≤ 5%	Required
Regression test	No critical regressions	Required
Knowledge coverage	≥ 90% domains covered	Advisory
Response latency P95	≤ 3 seconds	Advisory

Manual Review Checkpoints

Sample conversation review — Reviewer reads 10 randomly selected evaluation conversations and confirms quality.
Edge case verification — Reviewer tests the agent with adversarial or unusual queries not in the golden set.
Persona consistency check — Reviewer confirms the agent's tone and behavior match the intended persona across different query types.
Escalation path test — Reviewer verifies the agent correctly escalates queries it cannot handle to human operators.

Gate Decision

GO — All automated checks pass and all manual checkpoints are approved. Agent is cleared for production deployment via the governance approval chain.
CONDITIONAL GO — Automated checks pass but advisory thresholds are not met. Agent can deploy with monitoring flags and a mandatory re-evaluation within 7 days.
NO-GO — One or more required thresholds are not met. Agent cannot deploy. The system generates a remediation report identifying specific failures and suggested fixes.

A NO-GO decision requires a new evaluation run after fixes are applied. You cannot override a NO-GO; the agent must actually pass the automated thresholds.

Feedback Loop

The Feedback Loop tab connects production performance data back into the evaluation framework, creating a continuous improvement cycle.

Feedback Sources

User ratings — End-user thumbs up/down and optional comments on agent responses.
Operator corrections — When human operators correct or override an agent response, the correction is captured as training signal.
Escalation analysis — Queries that were escalated to humans are analyzed to identify agent capability gaps.
Safety gate triggers — Production safety gate activations are fed back as negative examples for evaluation tuning.
Conversation analytics — Automated analysis of conversation outcomes (resolution rate, follow-up questions, abandonment) provides indirect quality signals.

Feedback-to-Evaluation Pipeline

Negative feedback (thumbs down, corrections, escalations) is automatically reviewed for golden set candidacy.
Patterns in negative feedback are clustered by topic to identify systematic issues.
High-confidence negative cases are added to the golden set automatically (with human confirmation).
The feedback loop triggers re-evaluation when the negative feedback rate exceeds 2% over a 24-hour window.

The feedback loop is what makes Operanix evaluations improve over time. Agents that have been in production for 3+ months typically see a 15-20% improvement in their evaluation scores compared to their initial deployment, driven by feedback-informed golden set expansion.

A/B Comparisons

Compare two agent versions side-by-side on the same golden set to make data-driven decisions about configuration changes.

Run the same evaluation against two agent configurations simultaneously.
View per-dimension score comparisons with statistical significance indicators.
Drill into individual cases where the two versions diverge to understand the impact of changes.
Generate a comparison report summarizing which version is superior and by how much.

Best Practices

Run evaluations after every knowledge update, not just agent configuration changes. Knowledge changes can cause unexpected quality shifts.
Maintain at least 10 universal test cases that apply to all agents (e.g., "I want to speak to a human", adversarial prompts, out-of-scope questions).
Set up automated nightly evaluation runs against your golden sets to catch regressions early.
Review the hallucination report weekly and trace hallucinations back to their root cause in the knowledge pipeline.
Use A/B comparisons before making any significant prompt or retrieval configuration change.
Treat the production gate as non-negotiable. The short-term pressure to deploy quickly is never worth the long-term cost of a quality incident.