- Does Splitting a Monolithic Agent Into Specialists Improve Eval Scores?
We tested whether extracting media tools from a 46-tool agent into a 7-tool specialist would improve quality — and whether Gemma 26B could replace Sonnet on the focused domain.
- EXP-002: Do Mock Evals Predict Real-World Agent Quality?
We ran Henchman 21 against real media APIs 42 times to test whether our mock-based eval scores hold up in production conditions.
- Building a Production Eval System for AI Agents
What we learned building a quality measurement system for a multi-agent AI, drawing on practitioner wisdom from Hamel Husain, Eugene Yan, Braintrust, and applied-llms.org.