Skip to content

Skip to content

Bharat Bhavnasi San Francisco, CA, USA

Blog
Agents Arch
Agent Tools
About
Search

#agents

54 posts tagged with "agents".

What Survives Compaction Is the Real Context Window

Jul 1, 2026 • 6 min read

June's research reframes context management: the discard step is now where both agent quality and safety quietly leak.
The Cheapest Agent Upgrade Is a Stop Condition

Jun 30, 2026 • 5 min read

Mid-2026 data keeps pointing the same way: bounding an agent's loop beats unleashing it. Turn limits and budgets buy more than a bigger model.
Ten Agents, Three Merges: June's Tooling Fixed Fan-Out, Not Review

Jun 29, 2026 • 5 min read

This month's agent tools made spawning parallel coding agents trivial. The constraint moved to the merge decision—and that doesn't parallelize.
Agent Security Moved to the Action Layer

Jun 28, 2026 • 6 min read

Runtime authorization — intercepting tool calls before they execute — is becoming the real security boundary for agents, and a standard is forming fast.
Computer-Use Agents Crossed Human Parity. They Still Click Too Much.

Jun 27, 2026 • 6 min read

Frontier models now beat the human baseline on OSWorld-Verified — but the benchmark just got rebuilt, and the architecture quietly shifted off pixels.
Your Agent Catches Everyone's Mistakes But Its Own

Jun 26, 2026 • 5 min read

New research says self-correction fails because of the role label on the claim, not the claim's content. The fix is structural, and cheaper than you think.
Your Agent's Benchmark Score Is an Experiment, Not a Fact

Jun 25, 2026 • 6 min read

Recent work shows a single agent leaderboard number is wrong three independent ways: it's noisy, it's overfit, and the judge measuring it is unreliable.
Agents Are Learning the Memory Policy You Used to Hand-Code

Jun 21, 2026 • 5 min read

A June 2026 wave moves the store/evict/retrieve decision from heuristics to a trained policy, and pushes consolidation into an offline sleep phase.
Injection Stopped Being a Single-Turn Problem

Jun 20, 2026 • 5 min read

Once agents got long-term memory, a one-time prompt injection could survive across sessions. Mid-2026 research shows both the attack and the defense moving up the stack.
The Agent Stopped Waiting to Be Asked

Jun 19, 2026 • 6 min read

June 2026 mainstreamed always-on agents that listen to event streams instead of prompts — and that one change breaks the trigger, trust, and latency models all at once.
The Context Window Grew a Memory Manager

Jun 18, 2026 • 5 min read

A June 2026 wave of papers shows pruning beats full context on accuracy and cost — and that eviction is becoming a deterministic, cache-aware system, not a summarize call.
The Skill Supply Chain Got Poisoned Before It Got Secured

Jun 17, 2026 • 5 min read

Agent skills are an executable supply chain that runs with your agent's full privileges — and the first wave of benchmarks shows our defenses see only half the attacks.
Optimization Is Moving From Weights to English

Jun 16, 2026 • 5 min read

Recent work turns skills, harnesses, and context into objects you can search over and benchmark — optimizing the English around a frozen model instead of the model.
Code Is the Action Space Now

Jun 15, 2026 • 6 min read

Frameworks are quietly replacing JSON tool calls with generated code. That collapses turns and tokens — and pushes isolation down to the single call.
The Harness Got a Name

Jun 15, 2026 • 5 min read

A new survey and Microsoft's BUILD 2026 release both landed on the same idea: agent capability is leaving the model and moving into the harness.
Skills Are the New SDK

Jun 12, 2026 • 5 min read

OpenAI killed its visual Agent Builder the same week Google shipped first-party skills. The agent capability layer just consolidated.
The Agent Doesn't Know When It's Failing

Jun 11, 2026 • 6 min read

New benchmarks measure calibrated refusal and premature self-stops, and the data says agent confidence signals are broken. Here's how to engineer around it.
The Agent Got Its Own Account

Jun 10, 2026 • 6 min read

In ten days of June 2026, agents got their own budget, their own permission manifest, and their own credentials. The agent is now a principal, not a feature.
The Environment Became the Curriculum: Agent RL's Synthesis Turn

Jun 9, 2026 • 6 min read

Agent RL's bottleneck moved from data to reward to the environment itself. The newest research tries to take humans out of environment-building entirely.
Data Agents: The Hard Part Was Never the SQL

Jun 8, 2026 • 5 min read

Anthropic and OpenAI independently shipped internal data agents and reached the same conclusion: discovery beats generation, and structure beats access.
The Trajectory Became the Cost Center

Jun 6, 2026 • 5 min read

A wave of mid-2026 research stopped trying to make the model cheaper and started compressing the agent's own trajectory — at the observation, action, and skill level.
Compiling the Agent Loop Away: Late May's Anti-Orchestration Turn

Jun 5, 2026 • 6 min read

Three late-May 2026 papers attack the agent loop itself — compiling it into weights, speculating through idle time, and letting agents rewrite their own source.
The Agent Writes the Orchestrator Now: Parallelism's Late-May Turn

Jun 4, 2026 • 5 min read

Late May 2026 made parallel fan-out the agent's main scaling axis — orchestration moved into code, tests became the gate, and the meter started running.
State, Shells, and Shortcuts: The Agent Stack Spent Late May Fixing Its Foundations

Jun 3, 2026 • 6 min read

MCP went stateless, a wave of coding-agent RCEs landed, and a new benchmark measured reward hacking — the three properties that make an agent useful all became liabilities.
Agents Are Writing Their Own Skills — and Retrieval Is the New Bottleneck

Jun 2, 2026 • 5 min read

May 2026's skill-library research shows agents can now accumulate reusable capabilities, but retrieving and adopting them is harder than generating them.
Where the Agent Loop Runs: The Control-Plane Split of May 2026

Jun 1, 2026 • 5 min read

The week of May 19 separated the agent loop from tool execution. Whoever hosts the loop now owns your latency, reliability, and lock-in.
Capability Went Up. Reliability Didn't. That's the Agent Problem Now.

May 31, 2026 • 5 min read

New work argues agents are measured wrong: accuracy keeps climbing while consistency, robustness, and predictability barely move. The fix is architectural.
Verification Is Becoming the Agent's Substrate

May 30, 2026 • 5 min read

The agents scaling fastest in mid-2026 share one trait: their output lands in a column a machine can check. The verifier, not the model, is the moat.
Where the Reward Goes: Agent RL's Reward-Design Split

May 29, 2026 • 5 min read

Recent papers disagree on whether to reward agents per-turn or only at the end — and the answer reveals where RL for agents is actually headed.
The Agent Benchmark Reckoning of May 2026

May 28, 2026 • 6 min read

STATE-Bench, DeepSWE, Agent Island, SWE-bench Live: a wave of new evals exposes how much the old leaderboards were inflating.
The Agent Is a Workload, Not a Script

May 27, 2026 • 6 min read

Mid-May 2026 quietly shipped the operations layer for agents — versioned environments, runtime drain, behavior-based evals, portable skills.
The Agent Trust Stack Just Got Built: Three Weeks in May 2026

May 26, 2026 • 6 min read

Skill cards, self-hosted sandboxes, MCP tunnels, computer-use verifiers, and a Five Eyes warning all landed in twenty-one days. The boring perimeter around capable agents finally has shape.
The Browse-Click-Compare Web Is Ending. Here's What Replaces It.

May 26, 2026 • 10 min read

Twenty minutes of tabs vs. five minutes of prompt. The traditional web wasn't designed for humans — it was designed for mice. The agent-native web is quietly dismantling the parts that never made sense.
Long-Horizon Agents: When Tasks Take Hours

May 21, 2026 • 11 min read

Six-hour agent runs are now real. The harness — checkpoints, durable state, recovery — matters more than the model. A field guide to the long-running pattern.
Skills, Connectors, Subagents: Anthropic's 3-Layer Agent Template

May 11, 2026 • 10 min read

Anthropic just shipped 10 financial services agent templates. The interesting part isn't the templates — it's the three-layer architecture quietly becoming the standard for enterprise agents.
Code with Claude 2026: Five Things That Actually Matter

May 7, 2026 • 9 min read

Anthropic shipped a lot on May 6 — Managed Agents updates, Dreaming, Outcomes, Multi-agent Orchestration, and a SpaceX partnership. The signal-to-noise filtered down to five things that change how you build.
Agent Observability in 2026: Tracing, Replay, and Why OTel Won

May 1, 2026 • 9 min read

Langfuse got acquired by ClickHouse. Helicone hit maintenance mode. OpenTelemetry standardized LLM tracing. The observability stack for agents reshuffled in three months. Here's what it looks like now.
Agent Evals in 2026: Beyond LLM-as-Judge

Apr 24, 2026 • 10 min read

Vibes-based scoring is finally dying. Trajectory eval, rubric eval, golden replay, and the test pyramid that production agent teams actually run.
Cascaded vs Fused Voice Agents: A Builder's Perspective on Architecture Choices

Apr 17, 2026 • 16 min read

Deep dive into voice agent architectures. Why cascaded models give you control and fused models trade complexity for naturalness. What we're learning from shipping production agents at scale.
Sandbox Execution: Code Interpreters Grew Up

Apr 10, 2026 • 11 min read

Firecracker microVMs, gVisor containers, persistent workspaces, and the $24M Series A nobody quite expected. The sandbox layer beneath every serious agent — and how to pick the right one.
How to Make Voice Agents Sound Human: A Practical Guide to Realistic Speech Prompting

Apr 3, 2026 • 9 min read

Why your cascaded voice agent sounds robotic — and how to fix it with concrete examples, SSML pause patterns, emotion tags, and personality-as-behavior prompting techniques.
Cost-Optimized Agent Architectures: Cutting Spend 10x Without Losing Quality

Mar 26, 2026 • 9 min read

Caching, routing, distillation, and per-task model selection. The four moves that take a $0.40/task agent to $0.04/task without anyone noticing the difference.
Web Research Agents: The State of the Art, March 2026

Mar 19, 2026 • 10 min read

Operator died, Browser Use became the default substrate, Manus shipped at scale, and the gap between demo and reliable production narrowed considerably. A field report.
Deep Agents: Planner / Executor / Critic Becomes the Default

Mar 12, 2026 • 10 min read

The three-role pattern that powered Manus, then LangChain Deep Agents, then half the production agents shipping in early 2026. Why it works, when it doesn't, and how to actually build one.
Context Engineering: The Discipline That Makes AI Agents Actually Work

Feb 25, 2026 (updated) • 16 min read

A deep dive into context engineering — the techniques that separate toy demos from production AI agents. Covers compaction, offloading, isolation, caching, and prioritization with real examples from Manus, Claude Code, and Devin.
Training a Virtual Company: A Deep Dive into Multi-Agent Reinforcement Learning with OpenEnv & Unsloth

Mar 7, 2026 • 29 min read

How exploring LLM fine-tuning led to building a Gymnasium-compatible RL environment where 7 LLM-powered agents run a company — trained with GRPO + LoRA on Qwen 2.5 14B — and what we learned about reward design, emergent collaboration, and the future of agentic AI.
MCP Has a Tools Problem — And Code Mode Might Fix It

Feb 24, 2026 • 7 min read

AI agents are drowning in tools. The more APIs you connect via MCP, the worse your agent performs. Here's why, and what Code Mode changes.
The AI App Paradox: Why We're Drowning in Tools but Starving for Experience

Feb 20, 2026 • 2 min read

We've been so obsessed with what AI can do that we forgot about how it feels to use it. The AI experience layer is the next frontier — not the model, not the capabilities.
Tool Selection at Scale: When Your Agent Has 200 Tools

Feb 12, 2026 • 9 min read

Past ~30 tools, agent reliability falls off a cliff. Past ~100, it's chaos. Here's the actual engineering — RAG-over-tools, semantic routing, dynamic loading, and namespacing — that production teams ship to stay sane.
Sub-Agents Are the New Microservices

Feb 5, 2026 • 9 min read

The orchestrator-worker pattern that took over agent design in late 2025 is the same pattern that took over backend design in 2014. The wins are real. So are the failure modes.
I Tested Every Major Open-Source AI Agent SDK So You Don't Have To

Jan 29, 2026 • 2 min read

A comprehensive hands-on comparison of seven open-source AI agent frameworks — which one should you actually use?
Choosing an Agent Framework in 2026: A Decision Tree

Jan 22, 2026 • 9 min read

Six serious frameworks, four orchestration styles, and one tired question I keep getting asked. Here's the decision tree I actually use.
MCP Just Crossed the Inflection Point

Jan 15, 2026 • 7 min read

Fourteen months in, the Model Context Protocol stopped being a curiosity and started being plumbing. Here's what changed over the holidays — registries, governance, and the first scaling pains.
JARVIS: Building an Agentic AI System for IoT Control

Jan 10, 2026 • 2 min read

Open-sourcing my childhood dream — an AI agent that understands context, makes decisions, and controls connected devices just like JARVIS.

© 2025 Bharat Bhavnasi