Prompt Caching on KbWen Blog

Token 成本的真相:分級,但別分太細

KbWen — Mon, 25 May 2026 11:00:00 +0800

TL;DR: 把 token 當成設計變數,不是月底才看的帳單。沒有治理的任務成本沒有上限;但反過來,把任務切得愈細也不會愈省——subagent 不共享快取、TTL 一過就重建,過度切分反而更貴。真正要找的是「對的顆粒度」:夠細到 AI 不會亂跑,夠粗到能一直讀同一份熱快取。這些數字會隨工具一直變,當概念看就好。

有一陣子我根本不看 token 花多少。直到某次一個自動重構的任務跑了大半個晚上,我隔天看用量才意識到一件事:這東西的成本不是「用完才知道」,是我在開始之前就決定好的。

這篇是英文版的中文對照,但角度不太一樣。英文那篇用比較分析的方式談「為什麼治理的成本是可預測的」;這篇是我自己一路試出來的版本,包括一個我原本以為對、後來發現錯的直覺。

Token 成本是開始任務前就決定的

一個任務會花多少 token,大部分在它「被分類」的那一刻就定了:要載入多少 context、要不要去翻所有技能文件、要不要再開 subagent。等你看到數字,數字早就花掉了。

所以「之後再來省 token」通常沒什麼用。貴的決定都在前面:把整個 codebase 讀一遍而不是先看索引、載入完整的技能定義而不是它的 metadata、把一個本來可以接續的工作當成全新的冷啟動重跑一次。這些事後很難補救,你能做的是下次分類分得好一點。

只用 Prompt 和技能,也能做到基本治理裡我提過,最便宜的那層治理,一個 CLAUDE.md 加一句「commit SHA 是什麼」,幾乎不花錢,原因就在這裡:它把決定提前到任務開始之前。

粗放的成本沒有上限

我後來想通的一件事是,token 成本其實分成兩種,差別不在多寡,在有沒有天花板。

有治理的工作有上限。Agentic OS 的 benchmark 裡最重的情境,一個跨多代理協作的架構變更,大約落在 6 萬 token 上下。你可以嫌它高,但很難說它「沒有上限」:它是個數字,事前就知道,而且你不看它的時候它不會自己長大。

放著不管的工作沒有這個上限。當 AI 說「好了」而你手上沒有任何可以查的東西,真正貴的不是它說那句話的 token,是後面所有建立在這個假完成上的工作。這個系列的第一篇有個具體例子:AI 說它實作了三個模組,其中兩個根本沒動過。它報告完成花的 token 很少,失控的是下游每一步都信了這份假報告。少了 evidence 的失敗反而是便宜的那種,你很快會發現;貴的是那種安靜地累積、等你發現已經繞很遠的。

治理在這個角度下,其實不是為了少花 token。它比較像是把花費從「沒上限」那一欄,搬到「有上限」那一欄。

我原本以為切細一點就省,結果不是

想通上面那點之後,我自然得出一個結論:既然結構能把成本框住,那就把任務切到最細吧。這個直覺在遇到快取之後就破了。

快取的算式大概是這樣(2026 上半的數字,而且它一直在動):讀快取大約是正常 input token 的十分之一,寫快取反而比正常還貴,短窗約 1.25 倍、長窗約 2 倍。也就是說,折扣只有在你「重複讀同一份快取」時才出現;每寫一份新的,你都在付溢價。

這就把「愈小愈省」整個翻過來,因為切分和快取有兩個很不友善的互動:

subagent 預設不繼承父代理的快取。 每開一個新的子任務,它通常要為自己需要的前綴重新付一次寫入。一個任務切五份,你可能是買了五次寫入,而不是把一次寫入攤平成很多次便宜的讀取。(現在有些工具開始提供 fork 模式讓子代理重用父快取——這件事本身就說明這個成本真實到值得工程繞過。)
快取有 TTL。 預設窗口很短。如果你把工作切到步驟之間的間隔超過它(某個 subagent 跑太久才回來、某個 fan-out 卡住),快取就過期了,下一步只能用全價重建。

所以一個切太細的任務,最後可能比它取代的那個粗版本更貴:更多寫入、更少讀取、更多重建。同一份 benchmark 也驗證了反方向:一個「載入一次、之後都從快取讀」的接續模型,比每次都整份重讀省了大約一半的執行成本。真正在省的是快取的連續性,不是任務切得多細。

我現在的做法不是追求「最細」,而是去找那個剛好的顆粒度:細到 AI 不會在裡面亂跑(非確定性被框住),又粗到我能一直讀同一份熱快取(不是一直寫新的冷快取)。太粗,錯誤成本會往沒上限的方向飄;太細,快取成本會爬上來。可用的範圍在中間某處,而且它會隨快取行為改變而移動——這部分變得很快。

我實際上怎麼分級

老實說,我心裡的分級沒有很科學,大概是這樣:

改一行、修個 typo 這種,我直接做,連 evidence 都只問一句。一個碰到多個模組的功能,我才會交代清楚範圍、要它先計畫再動手。需要動到架構、或要跨好幾個檔案彼此牽動的,我才會考慮多代理——而且會先停下來問自己,這個任務真的能平行嗎,還是我只是想看起來有在「分工」。

大部分任務其實落在最輕那一級。會出事的,通常是我把一個其實很單純的任務,因為「想用框架」而過度包裝的時候。

還沒定論

這篇講的東西我都還在調整,尤其快取那段,根本是個移動標靶——今天對的顆粒度,一年後可能就不一樣了。我比較有把握的是底下那個形狀:付一個你算得出來的成本,去框住一個你算不出來的成本。至於那條線畫在哪,我也還在試。

下一篇,我想談談記憶:Work Log:跨 session 的記憶機制講的就是當任務跨越多個 session、context 一直重來時,要怎麼把狀態留下來。

Agentic OS 是開源專案:github.com/KbWen/agentic-os

Token Economics of AI Agent Governance

KbWen — Mon, 25 May 2026 10:00:00 +0800

TL;DR: Governance tends to have a token cost you can put a ceiling on. Ungoverned work usually doesn’t; the recovery cost of an undetected error has no obvious upper bound. But the fix isn’t “split everything into the smallest possible pieces.” Caching changes the math: a fresh sub-context pays a cache-write premium and doesn’t inherit its parent’s cached prefix, so over-decomposition can cost more than the coarse version it replaced. The design target, at least with today’s caching, is the granularity that holds error cost and cache cost in check at the same time. It’s a moving target (pricing and model behavior change fast), so treat it as a way of thinking, not a fixed rule.

Most teams treat token spend the way they treat an electricity bill: it shows up after the fact, it’s mildly annoying, and it isn’t something they design around. That habit seems to be where a lot of the trouble starts. Token cost behaves less like a consequence of an agent setup and more like a property of it: decided, the way latency is, by the architecture rather than discovered in production.

Treating it as a design variable leads to two observations. One is reassuring: governance overhead is bounded. The other is the part that “just split the task” advice tends to skip past. Structure has its own cost curve, and past a point, more of it makes the bill worse, not better. Neither of these is a law. Pricing and model behavior in this space move quickly, and some of the specifics below will date. The shape of the trade-off is what seems to hold.

Token cost is a design variable, not an afterthought

Most of what sets a task’s token cost is decided at intake, not during execution. How a task is classified (how much context it loads, how many skills it probes, whether it spawns sub-agents) is largely fixed before any real work happens. By the time the number shows up, most of it was already committed.

That’s probably why “optimize tokens later” rarely gets far. The expensive choices tend to be made up front: reading every file instead of probing an index, loading a full skill definition instead of its metadata, running a continuation as a cold start. Those are hard to optimize away afterward; mostly you just classify better next time. Governing agents with prompts and skills alone is partly an argument that the cheapest governance, a CLAUDE.md and one completion question, costs almost nothing, precisely because it moves the decision to intake.

Bounded cost versus unbounded cost

The asymmetry underneath all of this is between two kinds of cost.

Governed work tends to have a ceiling. In the Agentic OS lifecycle benchmark, the heaviest scenario, an architecture change coordinated across parallel agents, lands around 61K tokens. You can argue about whether that’s high. It’s harder to argue that it isn’t bounded: it’s a number, it was knowable in advance, and it doesn’t keep growing while you look away.

Ungoverned work has no equivalent ceiling, which is where the public cost stories cluster. In one widely reported case, a company that rolled a coding agent out to thousands of engineers reportedly ran through its 2026 AI budget within about four months; a number of cost write-ups collect cases in the same shape. The recurring observation is that a coding agent is one of the first tools where spend isn’t bounded by user intent: it generates as much as you let it.

The first post in this series has the small-scale version of the same shape: an agent reported implementing three modules, two of which were never touched. The tokens it spent saying so were trivial. The unbounded part is everything downstream that trusted the false report. A missing-evidence failure is, oddly, one of the cheaper ones; you tend to find out fast. The costly failures are the ones that compound quietly.

Governance, in this framing, isn’t really about spending fewer tokens. It reads more like a way to move spend out of the unbounded column and into the bounded one.

Granularity has a cost too: caching changes the math

The tempting next step is to decompose everything into the smallest possible units. That runs into where the cost actually lives once caching is involved.

A cache read costs roughly a tenth of a normal input token; a cache write costs more than one, on the order of 1.25× for the short window and 2× for the longer one as of early 2026 (Anthropic’s prompt caching docs carry the current multipliers, and they do move). The discount only shows up when you read the same cached prefix repeatedly. The premium is paid every time you write a new one.

That detail complicates the “smaller is cheaper” instinct, mostly because of two ways decomposition interacts with the cache:

Sub-agents don’t inherit the parent’s cache by default. Each fresh sub-context typically pays its own cache write for the prefix it needs. (Some setups now offer a fork mode that reuses the parent cache, which is itself a sign the cost was real enough to engineer around.) Split a task five ways and you may have bought five cache writes instead of amortizing one across many cheap reads.
The cache has a TTL. The default window is short. Fragment work so the gap between steps runs past it (a sub-agent that takes too long to return, a fan-out that stalls) and the cache can expire, so the next step rebuilds it at full price.

So an over-decomposed task can end up costing more than the coarse one it replaced: more writes, fewer reads, more rebuilds. The same benchmark shows the other direction working: a continuation that loads context once and reads it from cache on later turns cut a feature’s execution cost by roughly half versus re-reading everything each time. Cache locality, more than task count, was doing the saving there.

The design target, at least with today’s caching, doesn’t seem to be “maximally fine.” It looks more like the granularity that contains non-determinism (pieces small enough that the agent can’t wander far) while keeping cache locality (pieces large enough that you keep reading one warm prefix instead of writing many cold ones). Too coarse and error cost drifts toward unbounded; too fine and cache cost climbs. The workable range sits somewhere in between, and it shifts as caching behavior changes, which it does, often.

The SLA/SLO parallel

This rhymes with the reasoning behind service-level objectives. You don’t set an SLO to make a system fast; you set it to make its behavior predictable, turning an open-ended risk into a budget you provisioned for on purpose. You provision a known amount of headroom and monitoring against an outage whose cost you can’t predict in advance.

Token budgeting for agents looks like the same move. The governance overhead (classification, an evidence check, scoped context) is the known cost you pay deliberately. What it fences in is the runaway: the silently compounding error, the cold-start rebuild loop, the helpful sub-agent that rewrote three files nobody asked about. The goal isn’t a smaller bill so much as a more predictable one.

The cheapest place to start is also the easiest to overlook: a project memory file the model reads on every task. AGENTS.md (which began life in OpenAI’s Codex and is now read by tools like Cursor and GitHub Copilot) and CLAUDE.md (Anthropic’s project-memory convention) cost a few thousand tokens that the cache then serves cheaply for the rest of a session. Pair that with one question at completion (what artifact proves this is done?) and you have a usable floor of governance for almost nothing. Most of what sits above it is the same trade at larger scale.

None of this feels settled. The caching math especially is a moving target, and the right granularity a year from now may not look like today’s. What seems stable underneath is the shape: a known cost, paid on purpose, to fence in one you can’t predict.

This post is part of a series on building real AI systems. Earlier posts: Why AI Agents Go Wrong: It’s Not the Model, Prior Art: What Distributed Systems Already Knows, and No Evidence, No Completion. A Chinese companion piece, Token 成本的真相:分級,但別分太細, takes the same topic from a more first-person angle. The framework is open source at github.com/KbWen/agentic-os.