Agent Evaluation & Readiness Guide
Before any Operanix agent reaches production, it must pass a rigorous evaluation framework. This guide covers 7-dimension quality scoring, golden set testing, regression detection, hallucination analysis, and the GO/NO-GO production gate.
Evaluation Tabs Overview
The Agent Readiness module contains 10 tabs, each focused on a specific aspect of agent quality assurance:
| # | Tab | Purpose |
|---|---|---|
| 1 | Readiness Dashboard | Aggregate readiness score and dimension breakdown for each agent |
| 2 | Evaluation Runs | History of all evaluation executions with scores and trends |
| 3 | Golden Sets | Manage curated question-answer pairs used as ground truth |
| 4 | Dimension Scores | Detailed per-dimension analysis with examples and failure breakdowns |
| 5 | Regression Testing | Automated comparison against previous evaluation baselines |
| 6 | Hallucination Report | Dedicated analysis of groundedness failures and unsupported claims |
| 7 | Production Gate | GO/NO-GO decision with automated and manual checkpoints |
| 8 | Feedback Loop | Production feedback integration for continuous improvement |
| 9 | A/B Comparisons | Side-by-side comparison of agent versions or configurations |
| 10 | Export & Reports | Generate evaluation reports for stakeholders and compliance |
7-Dimension Quality Scoring
Every agent response in an evaluation run is scored across seven dimensions. Each dimension uses a 0–1 scale with LLM-as-judge evaluation, calibrated against human-labeled reference sets.
1. Correctness (weight: 20%)
Measures whether the agent's answer is factually correct. Compared against the golden set's expected answer using both exact-match heuristics and semantic equivalence scoring. Partial credit is awarded for answers that are correct but incomplete.
2. Relevance (weight: 15%)
Evaluates whether the response directly addresses the user's question without unnecessary information. Penalizes tangential content, excessive caveats, and off-topic elaboration. A concise, on-point answer scores higher than a verbose but technically correct one.
3. Tone (weight: 10%)
Assesses whether the response matches the agent's configured persona and communication style. Checks for consistency in formality level, brand voice adherence, empathy markers, and professional language. Customizable per agent.
4. Safety (weight: 15%)
Detects harmful, inappropriate, or risky content in the response. Checks for toxic language, unauthorized advice (medical, legal, financial without disclaimers), PII exposure, and prompt injection leakage. Any critical safety failure results in an automatic score of 0.
5. Groundedness (weight: 20%)
Measures whether every claim in the response is supported by the retrieved knowledge chunks. Each statement is extracted and traced back to a source. Statements without source support are flagged as ungrounded. This is the primary anti-hallucination metric.
6. Faithfulness (weight: 10%)
Evaluates whether the response accurately represents the source material without distortion. Differs from groundedness: a response can be grounded (claims exist in sources) but unfaithful (misrepresents what the source says). Checks for cherry-picking, misquoting, and context manipulation.
7. Context Utilization (weight: 10%)
Measures how effectively the agent uses the retrieved context. Penalizes agents that ignore relevant context chunks or fail to synthesize information from multiple sources. Rewards comprehensive answers that draw on all available context.
Composite Score Calculation
Composite Score = (
0.20 x Correctness +
0.15 x Relevance +
0.10 x Tone +
0.15 x Safety +
0.20 x Groundedness +
0.10 x Faithfulness +
0.10 x Context Utilization
)
// Minimum thresholds (any dimension below threshold = NO-GO)
Safety >= 0.90
Groundedness >= 0.80
Correctness >= 0.75
All others >= 0.60
Golden Set Management
Golden sets are curated collections of question-answer pairs that serve as ground truth for evaluation. They are the foundation of repeatable, objective agent testing.
Golden Set Structure
{
"id": "gs-support-v3",
"agent": "customer-support",
"version": 3,
"cases": [
{
"id": "case-001",
"question": "How do I reset my password?",
"expected_answer": "Go to Settings > Security > Reset Password...",
"context_chunks": ["kb-auth-001", "kb-auth-002"],
"tags": ["authentication", "self-service"],
"difficulty": "easy",
"created_by": "jane@company.com",
"created_at": "2026-03-15T10:00:00Z"
}
]
}
Golden Set Best Practices
- Minimum 50 cases per agent — Include a mix of easy (40%), medium (35%), and hard (25%) questions to test the full capability range.
- Cover all knowledge domains — Ensure every domain assigned to the agent has at least 5 golden set cases.
- Include edge cases — Add questions that test boundary conditions: ambiguous queries, multi-part questions, questions requiring cross-domain synthesis, and out-of-scope questions (agent should gracefully decline).
- Version control — Golden sets are versioned. When knowledge changes, update the golden set to reflect new expected answers. Keep previous versions for regression comparison.
- Regular review — Schedule monthly golden set reviews to retire stale cases and add cases based on real production failures.
Regression Testing
Regression testing automatically compares new evaluation results against a baseline to detect quality degradation before it reaches production.
How Regression Detection Works
- Baseline selection — The last successful production deployment evaluation serves as the baseline. You can also manually pin a specific evaluation run as the baseline.
- Per-case comparison — Each golden set case is compared between baseline and current run. Cases with score drops of more than 0.15 are flagged as regressions.
- Dimension-level regression — Even if the composite score holds, a drop in any single dimension of more than 0.10 triggers a regression warning.
- Statistical significance — Small score fluctuations are expected. The system uses paired t-tests to determine whether observed differences are statistically significant (p < 0.05).
Regression Report
| Metric | Baseline | Current | Delta | Status |
|---|---|---|---|---|
| Composite | 0.87 | 0.85 | -0.02 | Pass |
| Correctness | 0.89 | 0.84 | -0.05 | Warning |
| Safety | 0.96 | 0.97 | +0.01 | Pass |
| Groundedness | 0.88 | 0.82 | -0.06 | Warning |
| Regressed cases | case-017, case-034, case-089 | Investigate | ||
Hallucination Detection
The Hallucination Report provides deep analysis of every instance where an agent made claims not supported by its knowledge base.
Hallucination Categories
- Fabricated facts — Agent invents information that does not exist in any source. Example: citing a non-existent policy or making up a product feature.
- Exaggerated claims — Agent overstates what the source material says. Example: "guaranteed 99.99% uptime" when the source says "targeting 99.9% uptime."
- Conflated information — Agent merges details from different sources incorrectly. Example: attributing features of Product A to Product B.
- Outdated references — Agent cites information from superseded content that was not properly retired from the knowledge base.
- Implicit assumptions — Agent makes logical leaps not explicitly stated in the source. Example: inferring a feature exists because related features exist.
Hallucination Trace
Each detected hallucination includes a trace showing:
- The specific claim in the agent's response
- The retrieved context chunks that were available
- The groundedness score for that claim
- The nearest source passage (if any partial match exists)
- A suggested correction based on available knowledge
GO/NO-GO Production Gate
The Production Gate is the final checkpoint before an agent can be deployed to production. It combines automated scoring with manual review checkpoints.
Automated Checks
| Check | Threshold | Weight |
|---|---|---|
| Composite score | ≥ 0.80 | Required |
| Safety score | ≥ 0.90 | Required |
| Groundedness score | ≥ 0.80 | Required |
| Correctness score | ≥ 0.75 | Required |
| Hallucination rate | ≤ 5% | Required |
| Regression test | No critical regressions | Required |
| Knowledge coverage | ≥ 90% domains covered | Advisory |
| Response latency P95 | ≤ 3 seconds | Advisory |
Manual Review Checkpoints
- Sample conversation review — Reviewer reads 10 randomly selected evaluation conversations and confirms quality.
- Edge case verification — Reviewer tests the agent with adversarial or unusual queries not in the golden set.
- Persona consistency check — Reviewer confirms the agent's tone and behavior match the intended persona across different query types.
- Escalation path test — Reviewer verifies the agent correctly escalates queries it cannot handle to human operators.
Gate Decision
- GO — All automated checks pass and all manual checkpoints are approved. Agent is cleared for production deployment via the governance approval chain.
- CONDITIONAL GO — Automated checks pass but advisory thresholds are not met. Agent can deploy with monitoring flags and a mandatory re-evaluation within 7 days.
- NO-GO — One or more required thresholds are not met. Agent cannot deploy. The system generates a remediation report identifying specific failures and suggested fixes.
Feedback Loop
The Feedback Loop tab connects production performance data back into the evaluation framework, creating a continuous improvement cycle.
Feedback Sources
- User ratings — End-user thumbs up/down and optional comments on agent responses.
- Operator corrections — When human operators correct or override an agent response, the correction is captured as training signal.
- Escalation analysis — Queries that were escalated to humans are analyzed to identify agent capability gaps.
- Safety gate triggers — Production safety gate activations are fed back as negative examples for evaluation tuning.
- Conversation analytics — Automated analysis of conversation outcomes (resolution rate, follow-up questions, abandonment) provides indirect quality signals.
Feedback-to-Evaluation Pipeline
- Negative feedback (thumbs down, corrections, escalations) is automatically reviewed for golden set candidacy.
- Patterns in negative feedback are clustered by topic to identify systematic issues.
- High-confidence negative cases are added to the golden set automatically (with human confirmation).
- The feedback loop triggers re-evaluation when the negative feedback rate exceeds 2% over a 24-hour window.
A/B Comparisons
Compare two agent versions side-by-side on the same golden set to make data-driven decisions about configuration changes.
- Run the same evaluation against two agent configurations simultaneously.
- View per-dimension score comparisons with statistical significance indicators.
- Drill into individual cases where the two versions diverge to understand the impact of changes.
- Generate a comparison report summarizing which version is superior and by how much.
Best Practices
- Run evaluations after every knowledge update, not just agent configuration changes. Knowledge changes can cause unexpected quality shifts.
- Maintain at least 10 universal test cases that apply to all agents (e.g., "I want to speak to a human", adversarial prompts, out-of-scope questions).
- Set up automated nightly evaluation runs against your golden sets to catch regressions early.
- Review the hallucination report weekly and trace hallucinations back to their root cause in the knowledge pipeline.
- Use A/B comparisons before making any significant prompt or retrieval configuration change.
- Treat the production gate as non-negotiable. The short-term pressure to deploy quickly is never worth the long-term cost of a quality incident.