Before you wire three agents together because “agents are powerful,” answer one question: have you demonstrated that a single agent or plain automation cannot do the job? Most multi-agent designs fail this test, and every agent you add multiplies coordination complexity rather than capability. A multi-agent system is a set of semi-autonomous components — each with its own goal, tools, and decision loop — that coordinate to solve a problem no single component owns end to end. The appeal is obvious: decompose a hard task into specialists, let a planner route between them, and watch the whole exceed the sum. The reality is that coordination is itself a hard engineering problem, and the cost of getting it wrong scales faster than the benefit of getting it right. This is a methodology article, not a survey of frameworks. The question we keep returning to with teams building agentic systems is not “which orchestration library?” but “should this problem be multi-agent at all, and if so, how do we keep the coordination layer from becoming the thing that fails?” When Does a Problem Genuinely Need Multi-Agent Architecture? The divergence between the disciplined and the naive approach happens at a single decision: committing to multi-agent before proving that simpler designs are insufficient. The default should be the opposite of what most demos suggest. Plain deterministic automation handles a surprising share of “agentic” use cases. A single well-scoped agent with a good tool set handles most of the rest. Multi-agent earns its keep only when the problem genuinely requires distributed reasoning — independent sub-problems that need different context, different tools, or genuinely concurrent progress. Three structural signals justify the jump: Irreducible context separation. Two sub-tasks need such different working context that stuffing both into one agent’s prompt degrades both. A code-review agent and a security-audit agent reason over the same diff but with incompatible attention budgets. Genuine concurrency. Sub-tasks can make progress in parallel and the latency win is real — not “we ran them in parallel and then waited for the slowest anyway.” Independent failure and retry domains. One specialist can fail, retry, or be swapped without re-running the others. If a failure anywhere forces a full restart, you have one logical agent wearing three costumes. If none of these hold, the honest answer is that you have a single-agent problem with extra coordination tax. We treat this as a feasibility gate, the same way we’d assess any generative AI use case for technical feasibility before committing architecture. The boundary matters because each additional agent adds coordination complexity — the ROI of multi-agent is only positive when the problem genuinely demands it. Multi-Agent Justification Checklist Run a candidate design through this before committing. Two or more “no” answers is a strong signal to collapse the design. Question If “no” Have you built and measured a single-agent baseline? Build it first; you have no comparison point Do sub-tasks need irreducibly separate context? Merge them into one agent Is there real concurrency, with a measured latency win? Sequence the calls in one agent Can one agent fail and retry without restarting the others? Your “agents” share a failure domain — collapse them Can you monitor each agent’s decisions independently? You can’t debug it in production; redesign or simplify Does the coordination logic fit in a diagram a new engineer reads in five minutes? The orchestration is the risk; simplify before shipping This is an observed pattern across the agentic builds we review, not a benchmarked threshold — but the asymmetry is consistent: teams almost never regret collapsing an over-decomposed system, and frequently regret the reverse. How Do the Agents Actually Coordinate? Coordination is where the design lives or dies, and it reduces to two questions: how is responsibility decomposed, and how do agents communicate? Responsibility decomposition assigns each agent a bounded job with a clear contract — inputs it accepts, outputs it guarantees, and explicitly, the decisions it is not allowed to make. Vague responsibility is the root cause of most coordination failures we see. When two agents both believe they own “deciding when the task is done,” you get either premature termination or an infinite handoff loop. Inter-agent communication comes in a few recognizable shapes, and the choice constrains everything downstream: Orchestrator-worker — a central planner decomposes the task, dispatches to specialists, and integrates results. Easiest to monitor because all decisions route through one place; the orchestrator is also the single point of failure and the latency bottleneck. Sequential pipeline — agents form a chain, each consuming the previous one’s output. Predictable and debuggable, but errors compound down the chain and there’s no concurrency. Blackboard / shared state — agents read and write a common workspace and act when conditions are met. Flexible, but the shared state becomes a coordination hazard: race conditions, stale reads, and emergent behaviour that’s hard to reproduce. Peer-to-peer negotiation — agents talk directly and reach agreement. Maximally flexible, minimally observable; reserve it for problems that genuinely cannot be centrally planned. In practice the orchestrator-worker pattern is the right default for production LLM-based systems precisely because observability is built into the topology. You can read every routing decision in one log. The more decentralized the communication, the more emergent — and the harder to debug when coordination drifts. This connects directly to the broader agentic AI design that separates planning from execution: a multi-agent system is agentic architecture with the planning loop made explicit across components. How Do Multi-Agent Systems Break in Production? The failure modes are specific, and they are not the same as single-agent failures. A single agent fails by hallucinating or looping. A multi-agent system fails by coordinating wrongly — the individual agents can each be behaving correctly while the system as a whole goes off the rails. Failure cascades. Agent A produces a subtly wrong output, agent B treats it as ground truth and amplifies it, agent C builds on B. By the time the error surfaces, three agents have compounded it and the root cause is buried. Without explicit validation at each handoff, confidence propagates faster than correctness. Deadlocks and livelocks. Two agents each wait for the other to act (deadlock), or they pass a task back and forth without converging (livelock). This is almost always a responsibility-decomposition defect — overlapping ownership of a termination decision. Behavioural drift. The system worked in testing and slowly degrades in production. The usual cause is that one agent’s behaviour shifted — a model version bumped, a prompt was tuned, a tool’s output format changed — and the change rippled through the coordination layer in ways nobody traced. This is a coordination-specific instance of the broader GenAI failure patterns that sink generative AI projects, amplified because the blast radius is the whole system rather than one component. Cost and latency blowup. Every agent call is an inference call. A naive multi-agent loop can fan out into dozens of LLM calls per request, and the cost is multiplicative, not additive. We’ve seen designs where adding a “reviewer agent” tripled per-request cost for a marginal quality gain — a textbook over-engineering outcome. The defenses are structural, not heroic: validate at every handoff rather than trusting upstream outputs, give every agent a hard budget (token, call-count, and wall-clock), make termination a single agent’s explicit responsibility, and instrument each agent’s decisions so a coordination failure is visible before it becomes a customer-facing one. How Do You Monitor a Multi-Agent System in Production? You cannot monitor a multi-agent system the way you monitor a single service. The unit of observability is the interaction trace, not the individual call. Standard application monitoring tells you an agent responded in 800ms; it does not tell you the orchestrator dispatched the same sub-task three times because a worker kept returning malformed JSON. A workable monitoring baseline tracks four things per request: The full coordination trace — every dispatch, handoff, and decision, linked by a single request ID, so you can replay how the agents reached a result. Per-agent decision logs — what each agent was asked, what it decided, and why, captured well enough to detect drift when behaviour shifts. Budget consumption — calls, tokens, and latency per agent, with alerts when a request exceeds its envelope (the early signal of a livelock or runaway fan-out). Handoff validation outcomes — how often each inter-agent contract is violated, which localizes cascades to their origin rather than their symptom. Building this observability layer is most of the real engineering work in a production multi-agent system, and it’s why moving from prototype to production is a project in its own right — the same gap we describe in what it takes to move a generative AI prototype into production. A demo coordinates happily on the happy path. Production is where you discover that 5% of requests trigger a coordination edge case nobody traced. How Does Multi-Agent Reinforcement Learning Differ from LLM Orchestration? These are different disciplines that share a name, and conflating them causes real confusion. Multi-agent reinforcement learning (MARL) is about agents that learn coordination policies through reward over many episodes — the agents’ behaviour is a trained artifact, optimized for a reward signal in environments like games, robotics, or market simulation. LLM-based multi-agent orchestration is about agents whose behaviour is specified through prompts, tools, and control flow; coordination is engineered, not learned. The practical implication: MARL gives you emergent, optimized coordination at the cost of interpretability and a training apparatus, while LLM orchestration gives you inspectable, hand-designed coordination that you debug like software. For the production GenAI use cases most teams face, orchestration is the relevant frame — you want to read why the system did something, not infer it from a learned policy. MARL becomes relevant when the coordination strategy itself is the thing that needs optimizing and you have an environment to train in. Choosing between them is a foundational architecture decision, not an implementation detail. Choosing the Orchestration Approach The framework decision follows the architecture decision, not the other way around. Once you know your coordination topology, the choice between an established framework and a hand-built orchestrator turns on how much your control flow deviates from what a framework assumes. We treat this as its own decision and cover it in depth in how to choose an AI agent framework for production — including when building your own thin orchestrator beats adopting a heavyweight one. The throughline across both decisions: complexity must be justified by the problem, not by the tooling. A framework that makes it trivial to add agents also makes it trivial to over-engineer. The discipline is the same one we apply across every generative AI engagement — start from the simplest design that could work, and add coordination only when you’ve proven you need it. FAQ What is a multi-agent system, and how do its agents coordinate? A multi-agent system is a set of semi-autonomous components, each with its own goal, tools, and decision loop, that coordinate to solve a problem no single component owns end to end. Coordination happens through one of a few topologies — orchestrator-worker, sequential pipeline, shared blackboard, or peer-to-peer negotiation — each trading flexibility against observability. The two design questions that determine success are how responsibility is decomposed among agents and how those agents communicate. When does a problem genuinely require multi-agent architecture versus single-agent or plain automation? Multi-agent is justified only when the problem requires distributed reasoning: irreducibly separate context per sub-task, genuine concurrency with a measured latency win, or independent failure and retry domains. If none of these hold, you have a single-agent problem carrying coordination tax, or a case for plain deterministic automation. The disciplined approach builds and measures a single-agent baseline first, because teams rarely regret collapsing an over-decomposed system and frequently regret the reverse. How do multi-agent systems break in production? They break by coordinating wrongly even when individual agents behave correctly. The specific modes are failure cascades (one agent’s subtle error amplified downstream), deadlocks and livelocks (overlapping ownership of a termination decision), behavioural drift (a model or prompt change rippling through coordination), and multiplicative cost and latency blowup. The defenses are structural: validate at every handoff, give each agent a hard budget, make termination one agent’s explicit job, and instrument every decision. What design patterns govern inter-agent communication and responsibility decomposition? Responsibility decomposition gives each agent a bounded contract — inputs accepted, outputs guaranteed, and decisions it is explicitly forbidden to make — to prevent overlapping ownership. Communication follows recognizable topologies: orchestrator-worker (easiest to monitor, single point of failure), sequential pipeline (predictable but error-compounding), blackboard/shared state (flexible but race-prone), and peer-to-peer (maximally flexible, minimally observable). Orchestrator-worker is the production default precisely because observability is built into the topology. How do I monitor a multi-agent system in production and detect coordination failure early? The unit of observability is the interaction trace, not the individual call. A workable baseline tracks the full coordination trace linked by request ID, per-agent decision logs to detect drift, budget consumption with alerts for runaway fan-out, and handoff validation outcomes to localize cascades to their origin. Building this observability layer is most of the real engineering work in a production multi-agent system. How does multi-agent reinforcement learning differ from LLM-based multi-agent orchestration? Multi-agent reinforcement learning trains coordination policies through reward over many episodes, so behaviour is a learned, optimized artifact at the cost of interpretability. LLM-based orchestration specifies coordination through prompts, tools, and control flow, so behaviour is engineered and inspectable, debugged like software. Orchestration is the relevant frame for most production GenAI use cases; MARL becomes relevant when the coordination strategy itself needs optimizing and you have a training environment. What orchestration frameworks and patterns exist for coordinating multiple agents, and how do I choose between them? The framework decision follows the architecture decision: once you know your coordination topology, the choice between an established framework and a hand-built orchestrator turns on how far your control flow deviates from what a framework assumes. A framework that makes adding agents trivial also makes over-engineering trivial, so the deciding discipline is whether complexity is justified by the problem rather than enabled by the tooling. We cover this decision in depth in our guide to choosing an AI agent framework for production. Multi-agent architecture is a feasibility question before it is an engineering one: does this problem require distributed reasoning, and if it does, can you observe the coordination well enough to catch a cascade before your customer does? Answer those two before you wire the second agent — every one you add after that is complexity you must justify, not capability you get for free.