Sumit Gupta
About GitHub
  • Proving Agent Quality With Data Apr 8, 2026

    A series of experiments testing whether specialized AI agents on local models can match cloud API quality for personal task management.

  • Does Splitting a Monolithic Agent Into Specialists Improve Eval Scores? Apr 8, 2026

    We tested whether extracting media tools from a 46-tool agent into a 7-tool specialist would improve quality — and whether Gemma 26B could replace Sonnet on the focused domain.

  • EXP-002: Do Mock Evals Predict Real-World Agent Quality? Apr 8, 2026

    We ran Henchman 21 against real media APIs 42 times to test whether our mock-based eval scores hold up in production conditions.

  • EXP-003: Does Agent Specialization Replicate for Productivity Tasks? Apr 8, 2026

    The 40% improvement from media specialization only partially replicates for email/calendar. And Gemma 26B hits a wall on multi-step productivity chains.

  • Building a Production Eval System for AI Agents Apr 7, 2026

    What we learned building a quality measurement system for a multi-agent AI, drawing on practitioner wisdom from Hamel Husain, Eugene Yan, Braintrust, and applied-llms.org.

© 2026 Sumit Gupta RSS