You Don't Need a Multi-Agent System

Multi-agent systems look impressive in demos. In production, a single well-designed agent usually gets the job done with far less pain.

Every few months, a new architecture pattern takes over the AI engineering conversation. Right now, it's multi-agent systems.

The pitch sounds compelling: break your task into smaller pieces, assign each piece to a specialized agent, let them coordinate. Like a well-run team. Like good software design, even.

And then you actually build one.

What "multi-agent" means in practice

At its simplest, a multi-agent system is just multiple LLM calls coordinated together. One agent plans. One executes. One critiques. One summarizes. An orchestrator sits at the top and routes work between them.

Here's what a typical setup looks like:

async def run_multi_agent_pipeline(task: str):
    # Planner breaks the task down
    plan = await planner_agent.run(f"Break this into steps: {task}")

    # Executor runs each step
    results = []
    for step in parse_steps(plan):
        result = await executor_agent.run(step)
        results.append(result)

    # Critic reviews
    critique = await critic_agent.run(f"Review these results: {results}")

    # If critic is unhappy, loop again
    if "needs improvement" in critique.lower():
        return await run_multi_agent_pipeline(task)

    return await summarizer_agent.run(f"Summarize: {results}")

That's four separate LLM calls for one task. Four round trips. Four places where something can go wrong. Four prompts to maintain. And I haven't shown the error handling, the state management, or the part where the critic decides everything "needs improvement" and you're stuck in a recursive loop at 2am wondering what you've done.

The actual problems this creates

Debugging becomes distributed systems debugging. When a single-agent setup gives you a bad output, you look at the prompt and the response. That's it. With a multi-agent system, the bad output could have come from any agent in the chain. Or from the orchestrator misrouting. Or from state that got dropped between hops. You're now diagnosing a distributed system where every node is a black box.

Latency adds up. Each LLM call takes 1 to 3 seconds, sometimes more. Four agents means 4 to 12 seconds minimum before you get a response, assuming nothing retries. Add a reflection loop and you're looking at 30+ seconds for something a single agent handles in 3.

Cost compounds. You pay per token on every agent call. The planner reads the full task. The executor reads the plan plus the task. The critic reads everything. The summarizer reads it all again. A 1000-token task can easily burn 8000 tokens across all the agents. That's 8x the cost for roughly equivalent quality.

Coordination becomes the product. I've watched teams spend more engineering time on the orchestration layer than on the AI logic itself. That's usually a sign you've chosen the wrong abstraction for the problem.

What a single agent actually looks like

Here's the same kind of task, handled by one agent with tools:

from anthropic import Anthropic

client = Anthropic()

tools = [
    {
        "name": "search_codebase",
        "description": "Search for relevant code in the repository",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string"}
            },
            "required": ["query"]
        }
    },
    {
        "name": "read_file",
        "description": "Read the contents of a file",
        "input_schema": {
            "type": "object",
            "properties": {
                "path": {"type": "string"}
            },
            "required": ["path"]
        }
    },
    {
        "name": "write_file",
        "description": "Write or update a file",
        "input_schema": {
            "type": "object",
            "properties": {
                "path": {"type": "string"},
                "content": {"type": "string"}
            },
            "required": ["path", "content"]
        }
    }
]

def run_agent(task: str):
    messages = [{"role": "user", "content": task}]

    while True:
        response = client.messages.create(
            model="claude-opus-4-5",
            max_tokens=4096,
            tools=tools,
            messages=messages
        )

        if response.stop_reason == "end_turn":
            return response.content[0].text

        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = execute_tool(block.name, block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": result
                })

        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": tool_results})

One model. One loop. The agent decides when it's done, calls whatever tools it needs, and self-corrects mid-execution. No orchestrator needed because the model itself is doing the orchestration. When something goes wrong, there's exactly one place to look.

Single agent vs multi-agent architecture

When multiple agents actually make sense

I'm not saying multi-agent is never the right call. There are real scenarios where it's worth the complexity.

If your tasks are genuinely parallel and independent, running multiple agents concurrently can cut wall-clock time significantly. If you need one agent to verify another's output in a high-stakes domain where hallucinations are costly, the overhead is justified. If different parts of your pipeline need different models with different cost and capability trade-offs, a multi-agent setup can make that cleaner.

But those are specific problems with specific justifications. Most developers I've seen reach for multi-agent architecture because it looks good in a diagram, not because their single agent actually failed.

The question worth asking first

Before you build an orchestration layer, ask: has my single agent actually failed at this?

Not "could it theoretically be improved with multiple agents?" but "has it demonstrably failed and do I understand why?"

If you can't say yes to both, you're adding complexity to solve a problem you don't have. That's the oldest mistake in software engineering. It just looks shinier now because we dress it up with agent names and flow diagrams.

Start with one agent. Give it good tools. Write a clear system prompt. Let it run. See how far it gets before you reach for something more complicated.

The best architecture is the one you can debug at midnight. More often than not, that's the simple one.