Bharat Bhavnasi San Francisco, CA, USA

#testing

2 posts tagged with "testing".

Agent Evals in 2026: Beyond LLM-as-Judge

Apr 24, 2026 • 10 min read

Vibes-based scoring is finally dying. Trajectory eval, rubric eval, golden replay, and the test pyramid that production agent teams actually run.
Evaluating Agents: From Unit Tests to LLM-as-Judge Pipelines

Mar 30, 2026 • 8 min read

You can't ship agents you can't measure. The 2026 eval stack — task-level scoring, trajectory grading, LLM-as-judge with calibration, and the regression gates that catch silent quality drops.