Proving Agent Quality With Data
The Thesis
Can a system of specialized AI agents running on local models (Gemma 26B, free) match or beat a monolithic cloud agent (Claude Sonnet, paid) for personal task management?
We’re testing this with a series of experiments, each building evidence for the next. Every claim is backed by eval scores, not opinions.
The Scorecard
| Domain | Rusty | Specialist (Sonnet) | Best Local | Model | Verdict |
|---|---|---|---|---|---|
| Media | 2.52 | 3.53 (+40%) | 3.57 | Gemma 26B | Local viable |
| Productivity | 2.77 | 2.89 (+4%) | 2.62 | Qwen 3.5 | Needs Sonnet |
| Cross-domain | — | — | — | — | Planned |
| Supervisor | — | — | — | — | Planned |
The Experiments
EXP-001: Does Splitting a Monolithic Agent Into Specialists Improve Eval Scores? Specialization improves media evals by 40%. Gemma 26B matches Sonnet on the focused domain. Also discovered that eval infrastructure quality (harness mocks, scorer bugs) matters more than you think.
EXP-002: Do Mock Evals Predict Real-World Agent Quality? Real APIs score 5.6% higher than mocks. 93% of evals stable across 3 runs. Mock evals are a trustworthy conservative lower bound. Only variance source is model non-determinism, not API instability.
EXP-003: Does Agent Specialization Replicate for Productivity Tasks? Specialization replicates but with smaller gains (+4% vs +40%). Tested Gemma 26B, Gemma optimized prompt (+8.9%), and Qwen 3.5 (+11.5%). Qwen is best local model for productivity but still 9% below Sonnet. Best model is domain-dependent.
EXP-004: Cross-Domain Routing What happens when a request spans two agents? (“Find that movie and add a watch party to the calendar.”) Tests routing architectures for multi-agent coordination.
EXP-005: Failure Modes & Recovery How do agents handle errors, ambiguity, and wrong information? Production quality means graceful degradation.
EXP-007: Rusty on Local Models (Capstone) Can the supervisor agent run on a local model once specialized agents handle the domains? The final test of a fully local stack.
What We’ve Learned So Far
-
Specialization works, but the gains are domain-dependent. +40% for media (7 tools, short chains), +4% for productivity (20 tools, long chains). The benefit scales with how much you narrow the scope.
-
Local models are viable for some domains, not all. Gemma 26B matches Sonnet on media (short chains, 7 tools) but drops 19% on productivity (long chains, 20 tools). Qwen 3.5 is better for productivity but worse for media. The best local model is domain-specific — use the eval data to decide, not assumptions.
-
Your eval infrastructure matters as much as your agents. Harness quality, scorer bugs, and mock fidelity each independently affected our measurements. We found a 40% improvement hiding behind a scorer bug.
-
Mock evals are a trustworthy lower bound. Validated against real APIs — real-world scores are 5.6% higher than mocks. Safe to develop against mocks, validate against production periodically.
Methodology
All experiments follow practitioner best practices from Hamel Husain, Eugene Yan, Braintrust, and applied-llms.org:
- Binary pass/fail scoring on specific dimensions (not vague 1-5 “quality”)
- Domain expert calibration (one person’s judgment drives the system)
- Data-first: look at outputs before defining criteria
- Production flywheel: traces → human judgment → eval cases → automation
- Different model families for generation vs judging (avoids self-enhancement bias)
Full reference: Production Eval Reference