Skip to content
Skip to content

Hierarchical Agent Systems: Supervisors, Workers, and Routing

• 8 min read
Hierarchical Agent Systems: Supervisors, Workers, and Routing

The most common multi-agent topology in 2026 is hierarchical: one supervisor agent that decides what needs to happen, and a pool of worker agents specialized in particular kinds of work. It’s not the only pattern, but it’s the one that scales without falling over — and it matches how human teams organize the same kinds of work.

This post is the supervisor/worker pattern in practice: the shape, the patterns that hold up, the failure modes, and the variants worth knowing.

The basic shape

Supervisorroutes · aggregates · retriesstep budget · plan trackerWorker Aresearchweb · arxiv · docsWorker Bcode-genLSP · interpretersWorker Csummarizestructured outputWorker Dverifyreviewer / criticShared state / artifactsversioned keys · trace_id · plan

The supervisor reads input, picks workers, dispatches work, reads results, decides what’s next. Workers do their narrow thing and return. Shared state lives in a persistent store the supervisor and (sometimes) workers read from.

This isn’t novel — it’s the org chart pattern applied to agents. The shape works for the same reasons org charts work: specialization scales, coordination is centralized, and accountability is clear.

Implementing the supervisor

The supervisor is itself an LLM agent. Its tools are the worker invocations:

from langgraph.graph import StateGraph, END
from typing import Annotated, TypedDict

class State(TypedDict):
    messages: Annotated[list, add_messages]
    artifacts: Annotated[dict, merge_dicts]
    completed: list[str]
    next_worker: str | None

async def supervisor(state: State):
    """Decide which worker should act next, or finish."""
    prompt = build_supervisor_prompt(state)
    response = await supervisor_llm.ainvoke(prompt)
    decision = parse_decision(response)
    if decision.action == "finish":
        return {"next_worker": None, "messages": [response]}
    return {"next_worker": decision.worker, "messages": [response]}

async def call_worker(state: State):
    worker = WORKERS[state["next_worker"]]
    output = await worker.ainvoke({"context": state["artifacts"], "task": state["messages"][-1]})
    return {
        "artifacts": {state["next_worker"]: output},
        "completed": [state["next_worker"]],
    }

def route(state: State):
    if state["next_worker"] is None or len(state["completed"]) >= MAX_WORKER_CALLS:
        return END
    return "call_worker"

graph.add_node("supervisor", supervisor)
graph.add_node("call_worker", call_worker)
graph.add_edge("__start__", "supervisor")
graph.add_conditional_edges("supervisor", route, ["call_worker", END])
graph.add_edge("call_worker", "supervisor")

Things to notice:

  • The supervisor is small. It plans, routes, and finishes. It doesn’t do the work itself. Keeping it small keeps its prompt short and its decisions fast.
  • Workers don’t see each other. They take a context and a task; they return a result. The supervisor mediates.
  • Step budget at the supervisor level. MAX_WORKER_CALLS is the cap on how many times the supervisor can dispatch before ending. Prevents loops where the supervisor keeps re-routing.
  • Artifacts live in shared state. Each worker’s output is keyed and persisted; subsequent workers can read prior workers’ results from state["artifacts"].

Worker design

Workers are agents themselves. Three rules that make the system maintainable:

  • Each worker has one job. “Research a topic,” “summarize a document,” “verify a claim.” If you can’t describe the worker in one short sentence, split it.
  • Workers are stateless across calls. Each invocation reads context from the input, produces output. No worker holds memory across supervisor dispatches.
  • Workers can fail predictably. A worker that can’t complete returns a structured {"status": "failed", "reason": "..."} rather than throwing. The supervisor handles failures explicitly.

The contract between supervisor and worker is the most important interface in the system. Use a typed schema:

class WorkerInput(BaseModel):
    task: str
    context: dict
    constraints: dict = {}

class WorkerOutput(BaseModel):
    status: Literal["success", "failed", "needs_clarification"]
    result: dict | None = None
    reason: str | None = None

Patterns that hold up

Plan-then-execute

The supervisor’s first action is to lay out a plan: which workers, in what order, what each produces. The plan is stored in state. Subsequent supervisor calls track progress against the plan, not just “what next?” This catches plans that go off the rails — the agent’s later decisions are visibly diverging from the original plan.

async def supervisor(state):
    if "plan" not in state["artifacts"]:
        plan = await llm.ainvoke(PLAN_PROMPT.format(task=state["messages"][-1]))
        return {"artifacts": {"plan": parse_plan(plan)}, "next_worker": plan.steps[0].worker}
    # Subsequent calls: check progress vs. plan, dispatch next step or replan.
    ...

Specialist + reviewer

Pair every doing worker with a reviewing worker. The doer produces a draft; the reviewer scores it; the supervisor decides whether to ship, retry, or escalate. This is the multi-agent version of pair programming, and it catches a lot of bad-faith model behavior.

Bounded recursion

Workers can themselves be hierarchical (a worker is a supervisor over its own narrower set of workers). Useful for genuinely deep tasks. Risky if you don’t cap the recursion — without a depth limit, you can produce a fractal of supervisors.

Worker pool with load balancing

For high throughput, multiple instances of the same worker run in parallel. The supervisor dispatches; a queue picks an available worker. Looks like classic worker-pool architecture because that’s exactly what it is.

Failure modes specific to hierarchical systems

The new ones beyond single-agent debugging:

  • Supervisor hallucinated plan. The supervisor produces a plan referencing workers that don’t exist or capabilities they don’t have. Mitigation: enumerate available workers in the supervisor’s prompt; reject plans referencing unknown workers; constrain via structured output.
  • Workers diverge from supervisor’s intent. The supervisor delegated “summarize this technical brief,” the worker produced a marketing rewrite. Mitigation: tighten worker instructions; have the supervisor inspect outputs against intent.
  • Re-route loops. The supervisor keeps re-trying the same worker that keeps failing in the same way. Mitigation: track per-worker failure counts; force escalation after N failures.
  • Distributed context drift. Each worker’s output is summarized for the next worker, and summaries lose detail. Five hops in, the original task is unrecognizable. Mitigation: keep the original task verbatim in shared state; pass it to every worker.
  • Cost compounding. N worker calls × M tokens each. A hierarchical system can cost an order of magnitude more than a single agent doing the same work. Track cost per task; many “multi-agent” systems would be cheaper as well-designed single agents.

When NOT to use hierarchy

The pragmatic checks before going hierarchical:

  • The task fits in one agent’s planning capacity. Frontier models in 2026 can plan and execute 10-step tasks coherently. If your task is 4 steps, hierarchy is overhead.
  • The specializations are real. “Researcher” and “Writer” being separate agents only earns its keep if the researcher knows things the writer doesn’t (different tools, different prompts, different models). Otherwise it’s two LLM calls to do one LLM’s work.
  • You can debug it. Multi-agent traces are harder to read. If your observability isn’t already strong (Post 12), adding multi-agent complexity will hurt.

A single well-instrumented agent often beats a poorly-instrumented multi-agent system. The bar for going hierarchical is “I have a single agent that works, I’ve measured its limits, and I can articulate which specific limit a hierarchy would lift.”

Cross-framework hierarchies

A useful 2026 capability: the supervisor can dispatch to workers in different frameworks via A2A. A LangGraph supervisor calling CrewAI crews. A Google ADK supervisor calling a worker hosted as a Claude Managed Agent. The Post 10 protocols make this practical.

The pattern that works: workers are A2A endpoints; the supervisor doesn’t know or care what framework implements them. Same Agent Card, same call shape, regardless of internal implementation. This is the integration story for organizations that picked different frameworks for different teams and now want to compose them.

What “good” looks like

A 2026 hierarchical agent system worth shipping:

  • Small, focused supervisor with a clear plan-then-execute shape.
  • Workers with one job each and typed I/O.
  • Shared state with versioned artifacts; every worker’s output keyed and retrievable.
  • Step budget at the supervisor level; failure counts at the worker level.
  • Trace propagation across supervisor and worker boundaries.
  • Cost-per-task and worker-utilization dashboards.
  • A “single-agent baseline” you compare against, to keep the complexity earning its keep.

Next week we put the major frameworks side by side — LangGraph vs CrewAI vs ADK vs Managed Agents — with the honest comparison table.

References

Suggest changes