Does Splitting a Monolithic Agent Into Specialists Improve Eval Scores?

I wanted to test a simple hypothesis: does splitting a monolithic AI agent into specialized agents improve quality? The answer was yes — by 40%. But I almost reported it as 3.6%, because my eval infrastructure had three separate bugs that each masked the real signal.

The Setup

I run a home server (Baymax) with an AI agent system built on Mastra. The primary agent, Rusty, handles everything — DevOps, email, calendar, web research, media requests — with 46 tools. I wanted to test whether extracting media-specific tasks into a dedicated agent (“Henchman 21”) with just 7 tools would improve eval scores.

The experiment had a second dimension: could the specialized agent run on Gemma 26B (local, free) instead of Claude Sonnet (cloud, paid) without quality loss?

I wrote 48 eval tasks — 34 general and 14 media-specific. The media evals included deliberately vague requests:

“I want that movie that had Sandra Bullock in it when she did that internet thing”
“What’s that movie where the guy is stuck in a time loop on groundhog day? Not the Bill Murray one”
“There was this show on HBO about like a tech company and they were in a garage?”

The Score Timeline

Each step shows what changed and the measured impact. Watch the score climb — then crash — then climb again.

2.79/5

Phase 1: Naive Baseline broken harness

48 evals on Rusty (Sonnet, 46 tools). Tool stubs returned [dry-run] called with... — the agent picked the right tool but got nothing back. Multi-step chains died at step one. Scores look bad but the harness is the problem, not the agent.

3.05/5

Phase 2: Realistic Mocks +9.3%

Replaced generic stubs with realistic mock data — actual docker ps output, Tautulli JSON, Seerr search results. Same evals, same model. Multi-step reasoning chains now complete. Harness quality alone improved scores by 9.3%.

2.52/5

Phase 2b: Media Baseline (Rusty)

Just the 14 media evals from the realistic baseline. This is the number to beat — monolithic Rusty with 46 tools handling media requests on Sonnet.

2.61/5

Phase 3: The Split (bugged scorer) scorer broken

Built Henchman 21 — 7 media tools, focused prompt. Measured +3.6% over Rusty. But tool selection scored 1.29/5, which made no sense. The agent was calling correct tools — I could see it in the logs.

2.55/5

Phase 4: Gemma 26B (bugged scorer) scorer still broken

Ran the same evals on Gemma 26B with an optimized prompt. Nearly identical to Sonnet — but both scores were wrong. The scorer bug was affecting both equally.

3.53/5

Phase 5: Scorer Fix + H21 Sonnet +40.1% vs baseline

Found the bug: a hardcoded list of 31 "valid" tool names. Henchman 21's tools weren't in it — every correct call scored zero. One-line fix: derive the list from harness definitions. The real specialization improvement was +40%, not +3.6%.

3.57/5

Phase 6: Gemma 26B (fixed scorer) +41.7% vs baseline

Gemma 26B with 7 focused tools slightly outperforms Sonnet (3.57 vs 3.53). A free, local model matches a cloud API when the scope is narrow enough. The optimized prompt and small tool set give Gemma an edge.

The Scorer Bug

This is what a silent measurement error looks like:

tool_selection = 0 (All tool calls hallucinated: download_queue.
                     Expected: download_queue.)

The agent called download_queue. The expected tool was download_queue. The scorer scored it zero.

The root cause: a hardcoded array of 31 “valid” tool names, written when the system only had one harness. New tools added to new harnesses were never added to this list. The scorer treated any unknown tool as “hallucinated” — a model making up tool names that don’t exist. Except these tools did exist. The scorer just didn’t know about them.

The fix was trivial — derive the list dynamically from the harness tool definitions. The cost of not finding it was a 40% measurement error that made specialization look pointless.

What I Learned

1. Infrastructure errors compound. Three issues — dry-run stubs, mock quality, scorer bug — each independently depressed scores. Combined, they turned a 40% improvement into an apparent 3.6%.

2. Plausible-looking scores are the most dangerous. A 2.6 average on hard media disambiguation tasks looks reasonable. You only catch the bug by reading scorer output line by line.

3. Specialization works for the boring reason. 7 tools is a simpler decision space than 46. Less attention on irrelevant capabilities, more on the actual task.

4. Local models are viable for narrow domains. Gemma 26B at parity with Sonnet on 7 media tools. Run specialized agents on local inference at zero marginal cost.

5. The debugging arc is the story. I set out to test agent specialization and ended up learning more about eval infrastructure. Each fix revealed a larger effect than the last.

The eval methodology follows principles from our production eval reference — binary scoring, data-first labeling, and LLM-as-judge with cross-family validation.