EXP-003: Does Agent Specialization Replicate for Productivity Tasks?

The Question

In EXP-001, splitting media tools into a focused agent improved eval scores by 40%, and Gemma 26B matched Sonnet on that narrow domain. Does this pattern hold for productivity — email, calendar, contacts, family management?

The Evals

18 tasks themed around real suburban family life in Downers Grove, IL:

“There should be an email from the school about picture day. Can you add it to our calendar?” (email → read → calendar)
“I got an email from the basketball league with a link to the season schedule. Can you grab the dates?” (email → read → browser → calendar)
“Can you find that email about the thing at school next Friday?” (deliberately vague)
“Email the team parents that Saturday’s soccer game is cancelled” (compose + send)
“Can the kids do swim lessons on Wednesdays at 4? Check if we have anything” (conflict check)

These are harder than media evals — multi-step chains requiring email search → read → extract details → sometimes follow links → create calendar events. Following Hamel’s eval methodology, each is scored on specific dimensions (tool selection, completeness, actionability) rather than vague quality ratings.

Score Timeline

2.77/5

Rusty Baseline (46 tools, Sonnet)

18 productivity evals on the monolithic agent. Similar to the media baseline (2.52). Tool selection is decent (3-5) but completeness suffers — the agent selects the right tools but doesn't always complete the full chain.

2.89/5

Sheila Split (20 tools, Sonnet) +4.3%

Focused productivity agent with family-aware prompt. Modest improvement — much smaller than media's +40%. The tool set only shrank from 46→20 (vs 46→7 for media). The decision space is still relatively large.

2.35/5

Sheila (20 tools, Gemma 26B) -18.7% vs Sonnet

Gemma does NOT reach parity here. Unlike media (where it matched Sonnet), Gemma drops significantly on productivity tasks. Multi-step email→read→calendar chains break down.

2.56/5

Sheila (Gemma 26B, optimized prompt) +8.9% vs unoptimized

Added few-shot examples of multi-step chains (search→read→calendar), repeated constraints, explicit "do not skip the read step" rules. Closes some of the gap but doesn't reach Sonnet.

2.62/5

Sheila (Qwen 3.5) +11.5% vs Gemma unopt

Qwen 3.5 handles multi-step chains better than Gemma. Best local model for productivity, but still 9.3% below Sonnet. The gap is in completeness and synthesis, not tool selection.

Full Model Comparison

Config	Productivity (18 evals)	Media (14 evals)
Rusty baseline (Sonnet, 46 tools)	2.77	2.52
Specialist (Sonnet)	2.89	3.53
Specialist (Gemma 26B)	2.35	3.57
Specialist (Gemma, optimized prompt)	2.56	—
Specialist (Qwen 3.5)	2.62	3.17

Key pattern: Gemma wins on media (short chains, 7 tools). Qwen wins on productivity (long chains, 20 tools). Neither matches Sonnet on productivity. Model choice should be domain-specific.

Why Productivity Is Different

1. Longer chains. Media is search→identify→request (2-3 steps). Productivity is search email→read→extract details→follow link→create event (4-5 steps). This aligns with what we found researching Gemma’s characteristics: it “degrades significantly on 3+ sequential tool calls.” The data confirms it.

2. Fewer tools doesn’t help as much. Going from 46→7 tools (media) is a 85% reduction in decision space. Going from 46→20 tools (productivity) is only 57%. Productivity inherently needs more tools — you can’t email without Gmail tools, can’t schedule without Calendar tools, can’t research without browser tools. Following Braintrust’s guidance on using eval data to drive architecture decisions, the data says: productivity needs a more capable model, not just a smaller tool set.

3. Completeness requires synthesis. Media tasks have clear outputs (“here’s what I found, want me to request it?”). Productivity tasks require synthesis — “here’s the school newsletter summary with 3 action items and 2 important dates.” Gemma produces less complete summaries.

Devil’s advocate: The args_accuracy dimension scored 0 across nearly every eval in all three phases. This is likely a scorer calibration issue (expected argument names don’t match actual tool schemas), not a real quality signal. If args_accuracy is consistently wrong, it’s deflating all scores equally — the relative comparisons are still valid, but the absolute numbers are lower than they should be.

What This Means

Not every domain can go local. EXP-001 made local models look like a silver bullet. EXP-003 shows the boundary: domains with short chains and small tool sets (media: 7 tools, 2-3 steps) work great on Gemma. Domains with longer chains and more tools (productivity: 20 tools, 4-5 steps) don’t — yet.

Specialization benefits are domain-dependent. +40% for media, +4% for productivity. The benefit correlates with how much the tool set shrinks and how focused the domain is. This is useful data for deciding which domains to specialize next.

Sheila should stay on Sonnet until local models improve on multi-step reasoning. The economics still work — keep the expensive model for the hard tasks, use cheap local models for the easy ones. Per Eugene Yan’s eval-driven development approach, let the scores tell you when a cheaper model is ready.

Part of the agent quality experiment series. Eval methodology follows our production eval reference.