Proving Agent Quality With Data

The Thesis

Can a system of specialized AI agents running on local models (Gemma 26B, free) match or beat a monolithic cloud agent (Claude Sonnet, paid) for personal task management?

We’re testing this with a series of experiments, each building evidence for the next. Every claim is backed by eval scores, not opinions.

The Scorecard

Domain	Rusty	Specialist (Sonnet)	Best Local	Model	Verdict
Media	2.52	3.53 (+40%)	3.57	Gemma 26B	Local viable
Productivity	2.77	2.89 (+4%)	2.62	Qwen 3.5	Needs Sonnet
Cross-domain	—	—	—	—	Planned
Supervisor	—	—	—	—	Planned

The Experiments

EXP-001: Does Splitting a Monolithic Agent Into Specialists Improve Eval Scores? Specialization improves media evals by 40%. Gemma 26B matches Sonnet on the focused domain. Also discovered that eval infrastructure quality (harness mocks, scorer bugs) matters more than you think.

EXP-002: Do Mock Evals Predict Real-World Agent Quality? Real APIs score 5.6% higher than mocks. 93% of evals stable across 3 runs. Mock evals are a trustworthy conservative lower bound. Only variance source is model non-determinism, not API instability.

EXP-003: Does Agent Specialization Replicate for Productivity Tasks? Specialization replicates but with smaller gains (+4% vs +40%). Tested Gemma 26B, Gemma optimized prompt (+8.9%), and Qwen 3.5 (+11.5%). Qwen is best local model for productivity but still 9% below Sonnet. Best model is domain-dependent.

EXP-004: Cross-Domain Routing What happens when a request spans two agents? (“Find that movie and add a watch party to the calendar.”) Tests routing architectures for multi-agent coordination.

EXP-005: Failure Modes & Recovery How do agents handle errors, ambiguity, and wrong information? Production quality means graceful degradation.

EXP-007: Rusty on Local Models (Capstone) Can the supervisor agent run on a local model once specialized agents handle the domains? The final test of a fully local stack.

What We’ve Learned So Far

Specialization works, but the gains are domain-dependent. +40% for media (7 tools, short chains), +4% for productivity (20 tools, long chains). The benefit scales with how much you narrow the scope.
Local models are viable for some domains, not all. Gemma 26B matches Sonnet on media (short chains, 7 tools) but drops 19% on productivity (long chains, 20 tools). Qwen 3.5 is better for productivity but worse for media. The best local model is domain-specific — use the eval data to decide, not assumptions.
Your eval infrastructure matters as much as your agents. Harness quality, scorer bugs, and mock fidelity each independently affected our measurements. We found a 40% improvement hiding behind a scorer bug.
Mock evals are a trustworthy lower bound. Validated against real APIs — real-world scores are 5.6% higher than mocks. Safe to develop against mocks, validate against production periodically.

Methodology

All experiments follow practitioner best practices from Hamel Husain, Eugene Yan, Braintrust, and applied-llms.org:

Binary pass/fail scoring on specific dimensions (not vague 1-5 “quality”)
Domain expert calibration (one person’s judgment drives the system)
Data-first: look at outputs before defining criteria
Production flywheel: traces → human judgment → eval cases → automation
Different model families for generation vs judging (avoids self-enhancement bias)

Full reference: Production Eval Reference