The coordination ceiling
A single agent scales until it runs out of context, capability, or wall-clock time. The answer is decomposition: break the work across multiple agents running in parallel or in sequence. The problem is that decomposition doesn't eliminate complexity — it relocates it. The boundary between agents is where complexity concentrates.
At that boundary, an agent must hand off state, trust a peer's output, or wait on a resource it does not own. None of those operations are automatic. None of them are safe by default. And in production systems, the failure mode is almost never a crash — it is a silent wrong answer returned with full confidence.
The boundary between agents is where complexity concentrates.
The engineering community has spent years cataloguing single-agent failure modes: hallucination, context drift, prompt injection. The multi-agent failure modes are less documented and more dangerous, because they compound. A race condition between two agents doesn't produce a model error — it produces a data error that looks like a model error until someone traces the write sequence.
Five patterns, named
Production multi-agent systems converge on a small set of coordination primitives. The in-depth companion works each one in detail. Here they are named:
Concurrency caps. Unbounded fan-out kills the host. A system that spawns one agent per task with no cap will exhaust memory, hit rate limits, or starve the orchestrator of context budget before the first result returns. The cap is not a performance optimization — it is a correctness constraint.
Orchestrator/worker hierarchy. Every multi-agent system needs a single agent that owns the plan and routes work. Workers execute; the orchestrator decides. When that boundary blurs — when workers start making routing decisions — the system loses coherence and the failure is invisible until a deadline passes.
Supervisor gates. Some transitions require human sign-off before the next agent fires. Not because agents are unreliable, but because the downstream action is irreversible. A supervisor gate is not a bottleneck — it is the only defense against an autonomous chain that can't be recalled.
File and resource reservations. Two agents editing the same file produce corruption, not collaboration. Advisory leases — reservations that claim a resource before work begins and release it on completion — are the multi-agent equivalent of a database transaction. Skipping them turns parallel execution into a race condition.
Workspace isolation. Agents running in the same environment will step on each other's state unless their working contexts are explicitly separated. CMUX_WORKSPACE_ID, per-agent temp dirs, scoped environment variables — isolation is infrastructure, not convention.
What breaks and why it matters
Skip the concurrency cap and you get OOM. Skip the hierarchy and workers start countermanding each other's decisions. Skip supervisor gates and an irreversible action completes before anyone knew it was in flight. Skip reservations and a parallel write silently overwrites a peer's output — the kind of bug that only surfaces in post-mortems.
The most insidious failure is trusting a peer agent's self-report. An orchestrator that asks a worker "did you finish?" and accepts "yes" without independent verification has no actual guarantee. Agents can report completion while their side effects are still in flight, or report success on a path that silently fell through an error handler. Verification must be deterministic — a byte-level check, a row count, a build result — not a conversational confirmation.
These are not hypothetical failure modes. They are the failure modes of systems built without the patterns above. The patterns are not complex. What is complex is discovering them the first time in a production outage rather than in a design review.
Where to go from here
The in-depth companion works each pattern in full: concrete implementation schemas, the failure mode each pattern prevents, and the regulated-industry stakes when coordination fails in a Banking, Healthcare, or Aviation context where an invisible wrong answer has legal consequence.
If you are building a multi-agent system and none of the five patterns above are in your design doc, that is the finding.