Why AI Agents Fail in Production

TL;DR: “The agent did something wrong” usually gets diagnosed as a model problem. Most of the time it isn’t. Capability failures (wrong reasoning) and governance failures (no structure to catch wrong reasoning) look identical from the outside but need completely different fixes. This post is about telling them apart, and why most teams are currently solving the wrong one.

The agent said the feature was done. I asked for the commit SHA. There wasn’t one. When I checked the branch, two of the three modules it described implementing hadn’t changed.

The instinct in that moment is to reach for a better prompt, a smarter model, maybe a different tool call. That instinct is usually wrong.

What happened wasn’t a reasoning failure. The agent completed exactly the task it was given, interpreted through its own completion criterion, because no explicit one existed. There was no audit trail to check what it actually did. There was no scope boundary to constrain what “done” even meant. The model behaved correctly inside a system that gave it no structure to behave correctly toward.

That’s a governance failure, not a capability failure. And the fix is not a better model.

Two failure modes that look the same

When an agent produces bad output, the failure is almost always categorized as one thing: the AI got it wrong. Which leads to one solution category: better AI.

The problem is that “the AI got it wrong” conflates two distinct failure modes that have nothing to do with each other.

Capability failure: the model reasoned incorrectly. It missed a constraint, hallucinated a fact, drew a wrong inference. The fix lives in the model layer: better prompt, better retrieval, better fine-tuning, sometimes a more capable model.

Governance failure: the system had no invariant to catch or prevent what the agent did. The agent may have reasoned perfectly well and still produced a wrong outcome, because the surrounding structure gave it nothing to constrain against.

There’s a useful diagnostic test: would a smarter model have prevented this?

If yes, if the failure was clearly about incorrect reasoning or a factual miss, that’s a capability failure.

If no, if a brilliant expert given the same underspecified task would have made the same wrong choice, or a different wrong choice, because the task itself had no defined success condition. That’s a governance failure. Upgrading the model doesn’t help.

Most of the “unpredictable agent” complaints I’ve seen are governance failures. The problem gets framed as model unreliability because that’s what’s visible. The actual cause is invisible: the absence of structure.

The five structural gaps

These are the governance gaps that show up repeatedly, not as edge cases, but as the default state of most agent deployments. The zh-TW companion post AI 代理常見痛點與我們的嘗試 goes deeper on each one with narrative examples. Here I want to name the structural invariant that’s missing in each case.

Output not verifiable → no completion criterion or audit trail. The agent says “done.” You have no artifact to check against. The agent’s word that something happened is not evidence that it happened. The missing invariant: every task completion requires an attached evidence artifact: a file path, a commit SHA, a test result, something external to the conversation.

Steps skipped → no phase gate. Given a complex task, agents move toward output by the shortest path. Scope-setting, dependency mapping, impact analysis (anything that doesn’t look like “doing the thing”) gets skipped. The missing invariant: phases with entry and exit conditions that must be satisfied before proceeding. Pete Hodgson has written about this from an angle worth noting: when a problem has many valid solutions, the probability that an agent independently arrives at the one you actually wanted approaches zero. Pre-alignment isn’t overhead. It’s the phase gate that prevents redoing work.

Cross-session amnesia → no state handoff mechanism. Every new conversation is a blank slate. Decisions made in session one are unknown in session two. The agent rediscovers problems you’ve already solved, proposes patterns you’ve already rejected, rebuilds context you’ve already paid to build. An IEEE Spectrum report on AI coding tools documented this concretely: in longer sessions, agents increasingly regenerated functions that already existed and ignored conventions established earlier in the same session. The missing invariant: a structured work log that carries decisions forward across session boundaries. The mechanism we use is stupid-simple. It’s essentially forcing the agent to keep a diary. That description isn’t flattering, but cross-session amnesia is real enough that stupid-simple works.

Unbounded token cost → no resource scoping. An agent given a large task will read everything it can find, activate every relevant capability, and use as much context as the task allows it to justify. Without resource scoping, costs are unpredictable and you have no way to set expectations before a task starts. The missing invariant: task classification that routes to appropriately sized execution paths before the task begins.

Scope creep → no capability boundary. This is the quietest failure mode. The agent does what you asked, and also reorganizes a module you didn’t ask it to touch, and also “helpfully” updates a config file while it was in the neighborhood. Security researcher Johann Rehberger (Embrace the Red) made this failure mode concrete in April 2025 when he spent $500 testing Devin AI’s response to embedded instructions in GitHub issues, then reported the results to Cognition: 84–85% of attacks succeeded in getting the agent to execute actions outside the intended scope. That’s an extreme case, but the everyday version of this (the agent quietly expanding what “done” means) is the same structural gap. The missing invariant: explicit capability boundaries that define what the agent is allowed to do, not just what it’s been asked to do.

None of these gaps are model problems. A more capable model, given the same absent structure, makes the same category of errors, just more convincingly.

Engineering already solved these problems

These aren’t new problems with new solutions. They’re old problems that software engineering solved decades ago, applied to a different execution substrate.

Governance gap	Engineering equivalent
No completion criterion	CI gate: no merge without passing checks
No phase gate	PR review requirement: code doesn’t ship without sign-off
No state handoff	Audit log / ADR: decisions are recorded, not reconstructed
No resource scoping	Budget / SLA: bounded cost before work starts
No capability boundary	Principle of least privilege: access limited to what the task requires

The analogy isn’t decorative. These are the same structural mechanisms. Building the evidence requirement for Agentic OS, what I was really doing was rebuilding CI gates and audit logs under different names. The insight wasn’t new. It was just late.

A CI gate doesn’t trust the developer’s word that the tests pass. It requires evidence. An audit log records decisions at the time they’re made, so they don’t need to be reconstructed from memory later. Least privilege limits what an agent can touch, not out of distrust, but to contain the blast radius when something goes wrong.

The AGENTS.md convention, now adopted across Claude Code, Cursor, and GitHub Copilot as a standard way for agents to load project context, is essentially a machine-readable project governance document. It’s the same idea as a team’s architecture decision record, but in a format the agent reads automatically. That’s not a coincidence. It’s the same structural need surfacing in a new context.

What’s missing in most agent deployments isn’t better AI. It’s the application of mechanisms that software engineering already knows work.

What governance actually costs

“Adding structure” sounds like adding overhead. It’s worth being concrete about the actual numbers.

We measured governance overhead across several task types in Agentic OS v1.1 (April 2026, using chars/4 as the token estimation formula; actual counts vary by ±10% depending on tokenizer). For a quick-win task (something like fixing a date format in a CSV export), the governance overhead came to 17,041 tokens. For a complex feature touching API design, authentication, and database schema, it came to 50,975 tokens.

Those numbers sound large until you compare them to the cost of an ungoverned failure. A governance failure typically means: an undetected wrong completion that gets discovered later, a context restart, redone work, and scope cleanup. None of those costs are bounded or predictable.

The governance overhead is bounded. It scales with task complexity in a predictable way: the lightest path costs roughly 17K tokens; the heaviest measured scenario costs under 62K. The cost of recovering from a scope error or a missed completion criterion is not bounded. It depends on when you find it.

This isn’t an argument for any particular framework. It’s an argument for the structure itself: known, upfront cost versus unbounded, discovery-time cost. That trade-off is the same one CI gates resolved for software deployment twenty years ago.

The question to ask before the task starts

None of this requires a framework. The diagnostic test at the task level is simpler than that.

Before your next agent task: what artifact would prove this is done?

Not “what would it mean to be done.” That’s vague enough that the agent will fill in the answer. What artifact, specifically, would you point to afterward and say: here is the evidence this completed correctly?

If you can answer that question before the task starts, you have a completion criterion. If you can’t, you don’t, and the agent will invent one. That invented criterion is almost never the one you wanted. The everyday version doesn’t look like a security incident. It’s an agent that quietly refactored a module you didn’t mention, or updated a config file it found nearby. Its completion criterion included those things. Yours didn’t.

That’s the smallest possible governance structure. A definition of done, stated before work begins, tied to something observable.

The rest of the gaps (phase gates, state handoffs, resource scoping, capability boundaries) are the same logic applied at increasing scope. But they all start from the same place: deciding what “done” means before asking the agent to find out.

These observations are from building and using Agentic OS v1.1 (April 2026). The field moves fast — if a model capability has improved or a pattern here no longer holds, I want to know. The framework is open source and the issues are open: github.com/KbWen/agentic-os.

This post is part of a series on building real AI systems. Related reading: What Makes an AI Skill Different from a Prompt? covers the capability abstraction layer that sits below agent orchestration. The zh-TW companion post AI 代理常見痛點與我們的嘗試 covers the same failure catalogue with more narrative depth. Both build on Beyond Prompt: From Instructions to Building Systems.

Two failure modes that look the same#

The five structural gaps#

Engineering already solved these problems#

What governance actually costs#

The question to ask before the task starts#

Read next#

More in this thread

When an AI says "done," ask it to show you

No evidence, no completion

Claude Fable 5: First Public Mythos-Class Model, One Day In

uv: the Python tool that replaces pip, venv, and pyenv

Two failure modes that look the same

The five structural gaps

Engineering already solved these problems

What governance actually costs

The question to ask before the task starts

Read next