TL;DR: Governance tends to have a token cost you can put a ceiling on. Ungoverned work usually doesn’t; the recovery cost of an undetected error has no obvious upper bound. But the fix isn’t “split everything into the smallest possible pieces.” Caching changes the math: a fresh sub-context pays a cache-write premium and doesn’t inherit its parent’s cached prefix, so over-decomposition can cost more than the coarse version it replaced. The design target, at least with today’s caching, is the granularity that holds error cost and cache cost in check at the same time. It’s a moving target (pricing and model behavior change fast), so treat it as a way of thinking, not a fixed rule.


Most teams treat token spend the way they treat an electricity bill: it shows up after the fact, it’s mildly annoying, and it isn’t something they design around. That habit seems to be where a lot of the trouble starts. Token cost behaves less like a consequence of an agent setup and more like a property of it: decided, the way latency is, by the architecture rather than discovered in production.

Treating it as a design variable leads to two observations. One is reassuring: governance overhead is bounded. The other is the part that “just split the task” advice tends to skip past. Structure has its own cost curve, and past a point, more of it makes the bill worse, not better. Neither of these is a law. Pricing and model behavior in this space move quickly, and some of the specifics below will date. The shape of the trade-off is what seems to hold.

Token cost is a design variable, not an afterthought

Most of what sets a task’s token cost is decided at intake, not during execution. How a task is classified (how much context it loads, how many skills it probes, whether it spawns sub-agents) is largely fixed before any real work happens. By the time the number shows up, most of it was already committed.

That’s probably why “optimize tokens later” rarely gets far. The expensive choices tend to be made up front: reading every file instead of probing an index, loading a full skill definition instead of its metadata, running a continuation as a cold start. Those are hard to optimize away afterward; mostly you just classify better next time. Governing agents with prompts and skills alone is partly an argument that the cheapest governance, a CLAUDE.md and one completion question, costs almost nothing, precisely because it moves the decision to intake.

Bounded cost versus unbounded cost

The asymmetry underneath all of this is between two kinds of cost.

Governed work tends to have a ceiling. In the Agentic OS lifecycle benchmark, the heaviest scenario, an architecture change coordinated across parallel agents, lands around 61K tokens. You can argue about whether that’s high. It’s harder to argue that it isn’t bounded: it’s a number, it was knowable in advance, and it doesn’t keep growing while you look away.

Ungoverned work has no equivalent ceiling, which is where the public cost stories cluster. In one widely reported case, a company that rolled a coding agent out to thousands of engineers reportedly ran through its 2026 AI budget within about four months; a number of cost write-ups collect cases in the same shape. The recurring observation is that a coding agent is one of the first tools where spend isn’t bounded by user intent: it generates as much as you let it.

The first post in this series has the small-scale version of the same shape: an agent reported implementing three modules, two of which were never touched. The tokens it spent saying so were trivial. The unbounded part is everything downstream that trusted the false report. A missing-evidence failure is, oddly, one of the cheaper ones; you tend to find out fast. The costly failures are the ones that compound quietly.

Governance, in this framing, isn’t really about spending fewer tokens. It reads more like a way to move spend out of the unbounded column and into the bounded one.

Granularity has a cost too: caching changes the math

The tempting next step is to decompose everything into the smallest possible units. That runs into where the cost actually lives once caching is involved.

A cache read costs roughly a tenth of a normal input token; a cache write costs more than one, on the order of 1.25× for the short window and 2× for the longer one as of early 2026 (Anthropic’s prompt caching docs carry the current multipliers, and they do move). The discount only shows up when you read the same cached prefix repeatedly. The premium is paid every time you write a new one.

That detail complicates the “smaller is cheaper” instinct, mostly because of two ways decomposition interacts with the cache:

  • Sub-agents don’t inherit the parent’s cache by default. Each fresh sub-context typically pays its own cache write for the prefix it needs. (Some setups now offer a fork mode that reuses the parent cache, which is itself a sign the cost was real enough to engineer around.) Split a task five ways and you may have bought five cache writes instead of amortizing one across many cheap reads.
  • The cache has a TTL. The default window is short. Fragment work so the gap between steps runs past it (a sub-agent that takes too long to return, a fan-out that stalls) and the cache can expire, so the next step rebuilds it at full price.

So an over-decomposed task can end up costing more than the coarse one it replaced: more writes, fewer reads, more rebuilds. The same benchmark shows the other direction working: a continuation that loads context once and reads it from cache on later turns cut a feature’s execution cost by roughly half versus re-reading everything each time. Cache locality, more than task count, was doing the saving there.

The design target, at least with today’s caching, doesn’t seem to be “maximally fine.” It looks more like the granularity that contains non-determinism (pieces small enough that the agent can’t wander far) while keeping cache locality (pieces large enough that you keep reading one warm prefix instead of writing many cold ones). Too coarse and error cost drifts toward unbounded; too fine and cache cost climbs. The workable range sits somewhere in between, and it shifts as caching behavior changes, which it does, often.

The SLA/SLO parallel

This rhymes with the reasoning behind service-level objectives. You don’t set an SLO to make a system fast; you set it to make its behavior predictable, turning an open-ended risk into a budget you provisioned for on purpose. You provision a known amount of headroom and monitoring against an outage whose cost you can’t predict in advance.

Token budgeting for agents looks like the same move. The governance overhead (classification, an evidence check, scoped context) is the known cost you pay deliberately. What it fences in is the runaway: the silently compounding error, the cold-start rebuild loop, the helpful sub-agent that rewrote three files nobody asked about. The goal isn’t a smaller bill so much as a more predictable one.

The cheapest place to start is also the easiest to overlook: a project memory file the model reads on every task. AGENTS.md (which began life in OpenAI’s Codex and is now read by tools like Cursor and GitHub Copilot) and CLAUDE.md (Anthropic’s project-memory convention) cost a few thousand tokens that the cache then serves cheaply for the rest of a session. Pair that with one question at completion (what artifact proves this is done?) and you have a usable floor of governance for almost nothing. Most of what sits above it is the same trade at larger scale.

None of this feels settled. The caching math especially is a moving target, and the right granularity a year from now may not look like today’s. What seems stable underneath is the shape: a known cost, paid on purpose, to fence in one you can’t predict.

This post is part of a series on building real AI systems. Earlier posts: Why AI Agents Go Wrong: It’s Not the Model, Prior Art: What Distributed Systems Already Knows, and No Evidence, No Completion. A Chinese companion piece, Token 成本的真相:分級,但別分太細, takes the same topic from a more first-person angle. The framework is open source at github.com/KbWen/agentic-os.