#testing
2 posts tagged with "testing".
-
Agent Evals in 2026: Beyond LLM-as-Judge
• 10 min readVibes-based scoring is finally dying. Trajectory eval, rubric eval, golden replay, and the test pyramid that production agent teams actually run.
-
Evaluating Agents: From Unit Tests to LLM-as-Judge Pipelines
• 8 min readYou can't ship agents you can't measure. The 2026 eval stack — task-level scoring, trajectory grading, LLM-as-judge with calibration, and the regression gates that catch silent quality drops.