# AI Agent Evaluation: Practitioner Reference (LLM-Optimized)

> **Purpose**: Dense reference document for use as context in LLM sessions. Feed this URL/file to Claude, ChatGPT, or any model to give it current practitioner knowledge about production AI agent evaluation. Designed to be consumed as context, not read linearly by humans.
>
> **Last updated**: 2026-04-07
> **Maintained by**: Sumit Gupta (sumitngupta@gmail.com)

---

## CORE PRINCIPLES (Practitioner Consensus)

These principles are agreed upon by multiple independent practitioners who have shipped production eval systems. They are not theoretical — they come from post-mortems and operational experience.

### 1. Binary Pass/Fail Scoring
- **Rule**: Every evaluation is a binary pass/fail decision. Not 1-5, not 0-1 float.
- **Why**: Likert scales create ambiguity. Different reviewers interpret 3/5 differently. Binary forces a decision about what matters.
- **When you fail something**: Write a one-line critique explaining why. The critique is more valuable than any score.
- **Sources**: Hamel Husain (strongest advocate), Eugene Yan, Braintrust
- **Citation**: Hamel — "A binary decision forces everyone to consider what truly matters."

### 2. Domain Expert as Calibration Source
- **Rule**: One person's judgment calibrates the system. Not a committee.
- **Why**: Eval criteria drift as you see more data. Day 1 "good" ≠ day 30 "good". That's expected. But it must be one person's evolving judgment.
- **Flow**: Expert reviews traces → makes pass/fail calls → writes critiques → critiques become few-shot examples for LLM judges → judges are calibrated to expert's judgment.
- **Source**: Hamel Husain

### 3. Data Review = 60-80% of Effort
- **Rule**: Most eval effort is reviewing actual agent outputs, not building infrastructure.
- **Anti-pattern**: Building a fancy scoring pipeline but never reviewing traces.
- **Citation**: Hamel — "The real value of this process is looking at your data and doing careful analysis."
- **Sources**: All sources unanimously

### 4. Custom Scorers Over Generic Ones
- **Rule**: Build evaluation criteria from observed failures in YOUR system, not off-the-shelf metrics.
- **Why**: "Generic metrics embed someone else's requirements, not yours." Helpfulness/coherence/hallucination measure what someone else decided matters.
- **Approach**: Start generic for initial signal. Accumulate critiques. Build per-agent binary judges from observed failure modes.
- **Sources**: Hamel, Eugene Yan, Braintrust

### 5. Guardrails ≠ Evals
- **Guardrails**: Synchronous, in request path, fast, cheap, block bad output. Prevent harm.
- **Evals**: Asynchronous, post-response, expensive, subjective, feed improvement loops. Drive improvement.
- **Anti-pattern**: Running expensive LLM judges synchronously (slow responses) or only running cheap checks (missed insights).
- **Sources**: Eugene Yan, applied-llms.org

### 6. The Flywheel
- **Pattern**: Production trace → human judgment → becomes eval case → hardens suite → automated judge learns from critiques → monitors production → surfaces new traces → repeat.
- **Why**: Without this loop, your eval suite is frozen in time. It tests what you imagined users would do, not what they actually do.
- **Citation**: Braintrust — "Every production failure is a candidate for your eval suite."

### 7. Aspirational Evals
- **Rule**: Write evals that currently score ~10% but become viable when better models arrive.
- **Why**: When a new model drops, aspirational evals already answer "is this model ready?" without having to think about what to test.
- **Source**: Braintrust

---

## LLM JUDGE BIASES (Quantified Research)

From Eugene Yan's survey of LLM evaluator research literature.

| Bias | Effect | Magnitude | Mitigation |
|------|--------|-----------|------------|
| Position bias | Prefers first response in pairwise comparison | Claude-v1: ~70% | Swap order, run twice. N/A for single-output binary. |
| Verbosity bias | Prefers longer responses | GPT/Claude: >90% preference for verbose | Penalize unnecessary verbosity in judge prompt. |
| Self-enhancement | Rates own outputs higher | GPT-4: 10% boost, Claude-v1: 25% | Use different model family for judging vs generation. |
| Criteria drift | "Good" changes over time | Universal | Expected. Document evolution. Re-calibrate periodically. |

### Panel of LLM Judges (PoLL)
- **Finding**: Ensemble of 3 smaller LLMs with majority voting outperformed GPT-4 alone as judge.
- **Also**: Explicitly instructing GPT-4 "don't overthink" improved judging performance.
- **Practical use**: If single judge isn't reliable, try 2-3 cheap models with majority voting before upgrading.

### Hallucination Baseline
- **Fact**: Hallucinations baseline at 5-10% on straightforward tasks, difficult to suppress below 2%.
- **Source**: applied-llms.org
- **Implication**: Don't target 0%. Target knowing your rate and monitoring for increases.
- **Detection accuracy**: GPT-3.5-turbo: 84.88% on CNN factual consistency, 75.73% on XSUM. High precision (>95%) for consistent summaries, low recall (30-60%) for inconsistent ones.
- **Implicit hallucinations** (factually true but contradicts given context): 58.5% accuracy.

### Finetuned Judges: Cautionary Tale
- JudgeLM, PandaLM, Prometheus behave as task-specific classifiers, not general evaluators.
- High in-domain performance but poor generalization.
- Correlation higher amongst finetuned evaluators than with GPT-4 (learned to agree with each other, not humans).
- **Recommendation**: Don't fine-tune judge models. Use prompt-based judges with few-shot examples.

---

## HAMEL'S 6 EVAL SKILLS (Operational Playbook)

From github.com/hamelsmu/evals-skills. Executable steps, not theory.

### Workflow Order
```
eval-audit → error-analysis → write-judge-prompt → validate-evaluator
                ↑                                         ↓
                └──── generate-synthetic-data ←── build-review-interface
```

### Skill 1: eval-audit
- **When**: Starting a new project, inheriting an eval system, unsure if evals are trustworthy.
- **Checks 6 areas**: Error analysis quality, evaluator design, judge validation, human review process, labeled data sufficiency (~100 traces for analysis, ~50 pass/fail pairs for validation), pipeline hygiene.
- **Key insight**: "Evaluators built without error analysis measure generic qualities instead of actual failure modes."

### Skill 2: error-analysis (Highest ROI)
- **When**: After audit, after changes, when metrics degrade.
- **Process**: Gather ~100 traces → domain expert: pass or fail? → for failures, note root cause (not symptoms) → after 30-50 traces, group into 5-10 categories → let categories emerge organically → apply binary labels → compute failure rates.
- **Then**: Fix obvious gaps first (missing prompt instructions, absent tools, bugs). Then evaluate if remaining failures warrant dedicated evaluators.

### Skill 3: write-judge-prompt
- **When**: After error analysis identifies persistent failure modes that code checks can't catch.
- **4 components**: (1) Task & criterion — single failure mode, not "is this good?" (2) Binary definitions — concrete pass/fail from error analysis (3) Few-shot examples — 2-4 from training split, borderline cases most valuable (4) Structured output — JSON with critique preceding verdict.
- **Anti-patterns**: Vague criteria, evaluating whole traces instead of specific dimensions, missing few-shots, Likert scales, skipping validation, building judge for failure that should be fixed in the prompt.

### Skill 4: validate-evaluator
- **Data split**: Training 10-20% (few-shots in prompt), Dev 40-45% (iterate), Test 40-45% (measure once).
- **Metrics**: TPR and TNR, NOT raw accuracy. Because class imbalance (80% pass / 20% fail) makes 80% accuracy trivial.
- **Target**: Both TPR and TNR > 90% (minimum 80%).
- **Production**: Apply Rogan-Gladen bias correction for true success rates on unlabeled data.

### Skill 5: generate-synthetic-data
- **When**: Real traces sparse (new feature, new agent, pre-launch).
- **Process**: Define 3+ variation dimensions → draft 20 combinations → LLM generates 10+ more → convert to natural language in separate prompt (two-step = better variety) → filter for quality → run through pipeline → target ~100 diverse traces.
- **When NOT**: 100+ real traces available, complex domain-specific content, low-resource languages.

### Skill 6: build-review-interface
- **Core**: "Emails should look like emails. Code should have syntax highlighting."
- **Essential UX**: Binary Pass/Fail buttons as primary actions, free-text critique field, Defer option, auto-save, progress indicator.
- **Keyboard**: Arrow keys for navigation, 1/2 for Pass/Fail, D for Defer, U for Undo.

---

## ALIGNEVAL: DATA-FIRST WORKFLOW (Eugene Yan)

Key insight: "We should resist the urge to prematurely define criteria. We need to first immerse ourselves in the data."

### 4-Step Flow
1. **Upload**: CSV with id, input, output, label (binary 0/1)
2. **Label**: Binary pass/fail. Label BEFORE defining criteria. Let judgment emerge from data. 20 to unlock evaluation, 50-100 recommended.
3. **Evaluate**: Write task-specific criteria (two sentences defining pass/fail for ONE dimension). Select judge model. System computes recall, precision, F1, Cohen's κ, confusion matrix.
4. **Optimize**: With 50+ labels, run optimization trials. Example result: F1 improved from 0.571 → 0.727 across 5 trials.

### Critical Design Decision
**Label first, define criteria second.** The criteria emerge from your interaction with the data, not from a brainstorming session.

---

## FOUR-BOX CALIBRATION FRAMEWORK (Hamel)

For calibrating automated judges against human judgments:

|  | Human Pass | Human Fail |
|---|---|---|
| **Judge Pass** | True Positive ✓ | **False Positive** — judge too lenient. MOST DANGEROUS: bad outputs slip through. |
| **Judge Fail** | **False Negative** — judge too strict. Annoying but safe. | True Negative ✓ |

- After every 10-20 new human judgments: recompute the four-box.
- If false positives > 10%: tighten the judge.
- If false negatives > 20%: loosen it.
- Track precision/recall separately, not raw agreement (class imbalance makes agreement misleading).

---

## CI FOR AGENT CHANGES (3 Levels)

### Level 1: Assertion Tests (Every Change)
Fast, deterministic. Does agent respond? Forbidden patterns absent? Response within length bounds? Tool calls exist in registry? Valid structured output?

### Level 2: Golden Dataset Smoke Tests (Prompt/Model Changes)
10-20 critical scenarios per agent with expected tool calls and outputs. Run agent, check tool calls match expectations. 2-5 minutes. Blocks deployment on failure.

### Level 3: Full Eval Suite (Model Swaps, Weekly)
Complete golden dataset (100+ tasks). 30-60 minutes. Per-agent scores and comparison reports. Catches drift.

### Trigger Matrix
| Change Type | Level |
|-------------|-------|
| Any code change | Level 1 |
| Soul file or prompt change | Level 2 |
| Model swap | Level 3 |
| Weekly (catch drift) | Level 3 |

---

## SMART TRACE SAMPLING

Don't review random traces. Score by interestingness:

| Signal | Score | Why |
|--------|-------|-----|
| New model (recently switched) | +100 | Highest risk of regression |
| Tool errors | +80 | Direct failure signal |
| High latency (>2x rolling mean) | +60 | Model may be confused/looping |
| High token usage (>90th percentile) | +50 | Rambling, repeated attempts |
| Low automated score (<0.5) | +40 | Automated signal worth investigating |
| Unreviewed | +20 | Coverage gap |

---

## FAILURE CATEGORY TAXONOMY (Starter)

Let categories emerge from data. These are starting points:

| Category | Description | Typical Signal |
|----------|-------------|----------------|
| `wrong_tool` | Called wrong tool for task | tool_selection score low |
| `hallucinated_tool` | Invented nonexistent tool | Local/smaller model specific |
| `missing_action` | Understood request, didn't act | No tool calls in trace |
| `wrong_args` | Right tool, wrong arguments | args_accuracy score low |
| `poor_synthesis` | Raw data without analysis | Research agent specific |
| `tone_mismatch` | Personality doesn't match agent | response_tone score low |
| `incomplete_chain` | Started multi-step, didn't finish | Multi-tool agents |
| `wrong_delegation` | Delegated to wrong sub-agent | Supervisor agents |
| `unnecessary_delegation` | Delegated when could handle directly | Supervisor agents |

---

## SCORING APPROACHES (When to Use Which)

| Approach | Best For | Weakness |
|----------|----------|----------|
| **Direct scoring** | Objective tasks (faithfulness, policy) | Less reliable for subjective eval |
| **Pairwise comparison** | Subjective tasks (tone, coherence) | Can't evaluate single output in isolation |
| **Reference-based** | Tasks with gold-standard outputs | Requires annotated refs, can become fuzzy matching |

For single-agent binary pass/fail: direct scoring is correct.

---

## LANGFUSE SCORE TYPES

| Type | Format | Use Case |
|------|--------|----------|
| BOOLEAN | 0 or 1 | Pass/fail judgments |
| NUMERIC | float | Continuous metrics (latency, cost) |
| CATEGORICAL | string | Failure categories, model names |
| comment field | string | Critique text, reasoning |

All types support a `comment` field for context.

---

## GOODHART'S LAW WARNING

"When a measure becomes a target, it ceases to be a good measure."

**Example from applied-llms.org**: Needle-in-a-Haystack eval achieved near-perfect scores, but real-world extraction (medication from transcripts: ~80%, pizza ingredients: ~30%) was far worse.

**Rule**: Don't chase pass rates. If passing 100% of evals, bar is too low. 70% pass rate with meaningful tests > 100% with easy ones. The eval suite should make you uncomfortable.

---

## LANGFUSE ANNOTATION QUEUES: BUILD VS BUY

### What Langfuse Provides
Queue creation with predefined score configs, bulk/single population, "complete + next" progression, score attachment with comments, API access, user assignment.

### What's Missing (Justifying Custom UI)
No smart sampling (manual queue population only), no keyboard shortcuts documented, no failure category tagging, no progress gamification, no agent-specific filtering within queue, limited trace view customization.

### Recommendation
Use Langfuse as data store. Build custom review UI for power-user workflow. Store all judgments via Langfuse API. Get querying/visualization/experiment integration for free while having fast, keyboard-driven review.

For team contexts: start with Langfuse built-in queues, build custom once workflow friction is understood.

---

## KEY PRACTITIONER SOURCES

| Source | Key Contribution | URL |
|--------|-----------------|-----|
| Braintrust: "Evals are the new PRD" | Flywheel, maturity stages | braintrust.dev/blog/evals-are-the-new-prd |
| Braintrust: "Five hard-learned lessons" | Aspirational evals, whole-loop optimization | braintrust.dev/blog/five-lessons-evals |
| Braintrust: "I ran an eval. Now what?" | Four-box framework, iteration cycle | braintrust.dev/blog/after-evals |
| Hamel: "Your AI Product Needs Evals" | Three-level hierarchy, trace-viewing tools | hamel.dev/blog/posts/evals/ |
| Hamel: "LLM-as-Judge Complete Guide" | Critique Shadowing, binary pass/fail, calibration | hamel.dev/blog/posts/llm-judge/ |
| Hamel: "Evals FAQ" | 60-80% error analysis, start simple | hamel.dev/blog/posts/evals-faq/ |
| Hamel: "Selecting Eval Tools" | Custom UI on data stores, transparency > magic | hamel.dev/blog/posts/eval-tools/ |
| Hamel: "Field Guide" | Custom data viewers, experiment-based roadmaps | hamel.dev/blog/posts/field-guide/ |
| Hamel: evals-skills | 6 Claude Code skills for eval workflows | github.com/hamelsmu/evals-skills |
| Eugene Yan: "Product Evals in 3 Steps" | Label → align → run harness, position bias | eugeneyan.com/writing/product-evals/ |
| Eugene Yan: "LLM-as-Judge Won't Save" | Process > tools, scientific method | eugeneyan.com/writing/eval-process/ |
| Eugene Yan: "Evaluating LLM-Evaluators" | Bias survey, pairwise > direct, panel approach | eugeneyan.com/writing/llm-evaluators/ |
| Eugene Yan: "AlignEval" | Data-first design, label before criteria | eugeneyan.com/writing/aligneval/ |
| Eugene Yan: "Task-Specific Evals" | NLI for consistency, binary annotation | eugeneyan.com/writing/evals/ |
| Eugene Yan: "LLM Patterns" | Eval-driven development, metric types | eugeneyan.com/writing/llm-patterns/ |
| applied-llms.org | Intern test, Goodhart's Law, 5-10% hallucination baseline | applied-llms.org |
| Langfuse docs | Score API, annotation queues, datasets | langfuse.com/docs/scores/custom |

---

## HOW TO USE THIS DOCUMENT

**For Claude Code / ChatGPT sessions**: Paste this URL or file content at the start of a session when working on AI evaluation, agent quality, or LLM judge design. It provides current practitioner consensus that may not be in the model's training data.

**For keeping current**: This document will be periodically updated by a research agent that monitors the source blogs and adds new findings. Check the "Last updated" date at the top.

**For applying to your system**: The principles are universal. The specific architecture (Mastra, Langfuse, local inference) is one implementation. Adapt the patterns to your stack.