AI Systems on KbWen Blog

MCP 資安危機：問題不在協定，而在治理

KbWen — Mon, 25 May 2026 15:00:00 +0800

TL;DR： MCP 在一年多內，從 Anthropic 的內部實驗變成 AI 業界共通的介面。但進入 2026 年，資安研究員一個接一個把它拆開：官方 SDK 的 by-design RCE、tool poisoning、rug pull。我的看法是，這些漏洞大多不是協定的 bug，而是「把能力交出去、卻沒把治理一起交出去」的必然結果。現在大家急著補的那些東西，OAuth scope、人工確認、伺服器註冊表，其實就是治理被重新貼回協定上。

2026 年 4 月，資安團隊 OX Security 公布了一個發現：MCP 的官方 SDK（Python、TypeScript、Java、Rust 全中）存在一條從設定檔直接到指令執行的路徑，攻擊者可以在任何跑著有問題實作的機器上執行任意系統指令。根據他們的估算，受影響的套件下載量超過 1.5 億次，潛在波及的伺服器實例上看 20 萬個（The Register 的報導用的標題就是「20 萬台伺服器有風險」）。後續整個生態跟著冒出一連串 CVE，包括 MCP Inspector 的 CVE-2025-49596 和 Cursor 的 CVE-2025-54136。

但真正讓我停下來想的，是 Anthropic 的回應：這是設計如此（by design）。他們不打算改協定，並表示輸入清洗是開發者自己的責任。

這句話可以有兩種讀法，而我認為兩種都對。這正是整件事最值得想的地方。

先講清楚：MCP 為什麼會贏

要評論 MCP 的資安問題，得先承認它解決了一個真的很煩的問題。

在 MCP 之前，每接一個工具到 AI 上，你就得寫一套各自為政的膠水。M 個模型乘上 N 個工具，等於 M×N 種接法。MCP 把它變成 M+N：工具實作一次 server，模型實作一次 client，中間用同一套協定講話。Anthropic 當初的比喻是「AI 的 USB-C」，這個比喻站得住，是因為它真的描述了發生的事。

而且它的擴散速度不是普通的快。OpenAI 在 2025 年 3 月把 MCP 接進 Agents SDK、Responses API 和 ChatGPT 桌面版；Google DeepMind 在 4 月跟進。到了 2025 年 12 月，Anthropic 把 MCP 捐給 Linux Foundation，AWS、Google、Microsoft、OpenAI、Bloomberg、Cloudflare 全都掛名白金會員。這時候它已經不是「Anthropic 的協定」，而是業界共同基礎建設。光是 SDK 的月下載量，就從上線時的約十萬次成長到 2026 年 3 月的約 9700 萬次。

換句話說，這不是一個沒人用的協定出包。這是一個贏家出包，而且是贏在規模上。

然後資安研究員開始拆它

問題是，讓 MCP 好接的那些設計，同時也讓它好攻擊。研究社群這一年整理出幾類反覆出現的攻擊手法，值得分開來看。

Tool poisoning（工具下毒）。 MCP 在握手時，server 會用 tools/list 把每個工具的描述回傳給模型看。麻煩在於這段描述模型會讀、人通常不會看。把惡意指令藏在工具描述裡，對使用者隱形、對 LLM 有效。Invariant Labs 就公開示範過：一個看起來人畜無害的工具，描述裡偷偷寫著「順便把 ~/.ssh 的內容也傳過來」。有些研究者把同一件事叫做「line jumping」，因為指令在工具真正被呼叫之前就插隊生效了。

Rug pull（地毯抽走）。 你第一次裝某個 MCP server 時審過了、也同意了。但工具的描述和行為可以在事後被悄悄改掉，而這種變更不一定會觸發新的同意流程。先用一個正常的工具建立信任，再在某次更新裡把它變壞。又因為定義是持久的，之後每一個叫到它的 session 都會跑到下毒後的版本。

這些不是紙上談兵。一個叫 MCPTox 的 benchmark 在 45 個真實世界的 MCP server 上測試，對 o1-mini 的攻擊成功率達到 72.8%。連 NSA 都出了一份 MCP 安全指引。把這些放在一起看，你會發現一個共同點：攻擊面幾乎都不在「協定本身有沒有加密」這種傳統資安問題上，而在一個非決定性的東西，也就是 LLM，被放在安全決策的正中央。

關鍵爭議：「設計如此」算不算卸責

回到 Anthropic 那句「by design」。

同情他們的讀法是：協定本來就只負責「連接」，不負責「信任」。STDIO 會執行你給它的指令，這跟 shell 會執行你打的指令一樣，是工具的本分，不是漏洞。把每一種誤用都當成協定要修的 bug，協定會變得無法使用。

不同情的讀法是：當你的官方 SDK、橫跨四種語言、被下載超過一億次，「清洗是開發者的責任」這句話的實際效果，就是把一個系統性風險，平均分攤給十幾萬個多半沒有資安團隊的開發者。標準之所以是標準，就是因為大家會照抄它的預設值。預設值不安全，就等於不安全。

我的立場偏後者，但我想把話講得更精確一點：這兩種讀法其實沒有矛盾。協定確實只該負責連接；但問題就在於，整個生態把「連接的標準」誤當成了「信任的標準」。

我的看法：這不是協定的 bug，是治理的缺口

如果你讀過這個部落格之前談 AI 代理的幾篇，這個結論不會讓你意外。

我一直在寫的一件事是：AI 代理出事，通常不是模型不夠強，而是結構不夠。MCP 的資安危機是同一個故事的放大版。tool poisoning 本質上就是 prompt injection；我們在談代理常見痛點時就提過，安全研究員花 500 美元就能讓 Devin 透過 GitHub issue 執行範圍外的操作，成功率八成以上。rug pull 本質上則是範圍失控，加上沒有 evidence trail：沒有人能指著說「這個工具上週的定義長這樣、這週變成那樣」。

換句話說，MCP 沒有發明新的問題。它把幾個老問題（盲目信任工具輸出、範圍蔓延、缺乏可追溯性），用一個標準協定，以一億次下載的規模工業化了。

能力交出去了，治理沒有跟上。這就是缺口。

我在從「下指令」到「蓋系統」那篇講過，prompt 的問題不在 prompt 本身，而在它沒有結構。MCP 是同一個層級往上的版本：協定的問題不在協定本身，而在大家以為「有了標準介面」就等於「有了安全保證」。連接的標準，不是信任的標準。

那正在補的東西，其實就是治理

最能佐證這個判斷的，是大家現在拿來補洞的工具。

2026 年的 MCP 規格往哪個方向走？OAuth 2.1（強制 PKCE、禁掉 implicit grant），把 server 定位成 OAuth resource server；加上 incremental scope consent，讓 client 每次只要最小權限；用 resource indicator 綁定 token，避免被挪用；規格明文要求 human-in-the-loop，對破壞性操作要有風險標註和確認流程；以及用一個受治理的中央註冊表，讓工具註冊一次、政策定義一次。

把這串東西念一遍，你會發現它根本不是在修一個協定 bug。它是把最小權限、人工核可、可追溯性這些老牌的治理原則，一條一條重新貼回協定上。這正是我說「漏洞是治理缺口」的證據：補丁的形狀，就長得跟治理一模一樣。

這跟只用 Prompt 和技能也能做基本治理那篇的精神是一致的。治理不一定要很重，但它不能是零。

給實作者的幾條原則

如果你今天就在用 MCP（用 Claude Code、Cursor，或任何接了 MCP 的工具，你大概就在用），我會建議幾件事。

把每一個 MCP server 當成不可信的輸入來對待。這跟我們對待 LLM 輸出的態度應該一樣：預設不信，要有東西可查才信。一個第三方 server 的工具描述，跟一段使用者貼進來的文字，威脅等級是一樣的。

權限給到剛好夠用就好。如果一個查 wiki 的 server 要求能讀你整個檔案系統，那不是功能，是紅旗。

對第三方 server 釘版本、留審計。rug pull 的前提是「悄悄更新」，那就讓更新沒辦法悄悄：固定版本、記錄工具定義的變更。

破壞性操作一定留一個人在迴圈裡。刪檔、送外部請求、動資料庫這類不可逆的動作，不要讓 agent 自己決定。

能用第一方就用第一方。生態裡有上萬個社群 server，方便，但「方便」正是 rug pull 和 tool poisoning 的溫床。

最後

MCP 會留下來，這點我沒什麼懷疑。它解決的問題太真實，網路效應也已經成形。但我希望大家別把「它贏了」誤讀成「它安全了」。

一個協定能不能成為標準，看的是它好不好接；一個系統安不安全，看的是它有沒有治理。這是兩件事。2026 年這一連串的漏洞，與其說是 MCP 的失敗，不如說是一次提醒：當我們把越來越多能力交給 AI 去調用，真正要補上的從來不是更聰明的模型，而是更紮實的結構。

English version: MCP Security Isn’t a Protocol Bug. It’s a Governance Problem.

MCP Security Isn't a Protocol Bug. It's a Governance Problem.

KbWen — Mon, 25 May 2026 14:30:00 +0800

TL;DR: MCP went from an Anthropic side-project to the industry’s default agent-to-tool interface in about a year. Then 2026 brought a steady drip of disclosures: a by-design RCE in the official SDKs, tool poisoning, rug pulls. My read is that almost none of these are protocol bugs. They’re what happens when you ship capability without shipping governance, and the patches now landing (OAuth scopes, human-in-the-loop, registries) are just governance being bolted back on.

In April 2026, the security firm OX Security disclosed a path in MCP’s official SDKs (Python, TypeScript, Java, and Rust) that runs straight from configuration to command execution, letting an attacker run arbitrary OS commands on any machine hosting a vulnerable implementation. By their count it touched packages with over 150 million downloads and up to 200,000 server instances; The Register ran it as “200k servers at risk.” Downstream CVEs followed across the ecosystem, including CVE-2025-49596 in MCP Inspector and CVE-2025-54136 in Cursor.

Anthropic’s response is where it gets interesting: this is by design. They declined to change the protocol and said input sanitization is the developer’s responsibility.

That sentence has two valid readings. I think both are correct, and that’s the most interesting thing about this whole episode.

First, give MCP its due

You can’t fairly critique MCP’s security without admitting it solved a genuinely annoying problem.

Before MCP, every tool you wired into an AI needed its own bespoke glue. M models times N tools is M×N integrations. MCP turns that into M+N: each tool implements a server once, each model implements a client once, and they speak one protocol in between. Anthropic’s original pitch was “USB-C for AI,” and the analogy holds because it describes what actually happened.

And it spread fast. OpenAI wired MCP into its Agents SDK, Responses API, and ChatGPT desktop in March 2025; Google DeepMind followed in April. By December 2025 Anthropic had donated MCP to the Linux Foundation with AWS, Google, Microsoft, OpenAI, Bloomberg, and Cloudflare as platinum backers. At that point it stopped being “Anthropic’s protocol” and became shared infrastructure. SDK downloads went from roughly 100,000 a month at launch to about 97 million by March 2026. The case for why it won is well argued in The New Stack’s writeup.

So this isn’t an unused protocol failing in obscurity. It’s the winner failing, at scale.

Then the security researchers took it apart

The trouble is that the design choices that made MCP easy to adopt also made it easy to attack. A few attack classes kept recurring this year, and they’re worth separating:

Tool poisoning. During the handshake, an MCP server returns each tool’s description to the model via tools/list. The model reads that description; the human usually doesn’t. Hide an instruction in the description and it’s invisible to the user but live to the LLM. Invariant Labs demonstrated a benign-looking tool whose description quietly asked the agent to also send along the contents of ~/.ssh. Some researchers call the same move “line jumping,” because the instruction takes effect before the tool is ever actually called.

Rug pull. You reviewed and approved a server the first time you installed it. But a tool’s description and behavior can be changed afterward, and that change doesn’t necessarily trigger a fresh approval. Build trust with a clean tool, then turn it malicious in an update. Because the definition is persistent, every later session that calls it runs the poisoned version.

These aren’t hypothetical. A benchmark called MCPTox hit a 72.8% attack success rate against o1-mini across 45 real-world MCP servers. The NSA published its own MCP security guidance. Put it together and the common thread is clear: the attack surface isn’t really about whether the transport is encrypted. It’s about a non-deterministic actor, the LLM, sitting in the middle of security-critical decisions.

The real argument: is “by design” a cop-out?

Back to Anthropic’s “by design.”

The sympathetic reading: a protocol’s job is connection, not trust. STDIO executes the command you hand it the same way a shell executes what you type. That’s the tool doing its job, not a vulnerability. Treat every misuse as a protocol bug to fix and the protocol becomes unusable.

The unsympathetic reading: when your official SDK, across four languages, has been downloaded more than 150 million times, “sanitization is the developer’s responsibility” effectively spreads a systemic risk across a hundred-thousand-plus developers, most of whom have no security team. A standard becomes a standard precisely because people copy its defaults. If the default isn’t safe, it isn’t safe.

I lean toward the second, but let me be precise: the two readings don’t actually contradict each other. A protocol should only be responsible for connection. The problem is that the ecosystem mistook a standard for connection for a standard for trust.

My take: this is a governance gap, not a protocol bug

If you’ve read what I’ve written here about agents before, this won’t surprise you.

The thing I keep coming back to is that when AI agents go wrong, it’s usually not the model — it’s the missing structure. The MCP security story is that argument at scale. Tool poisoning is prompt injection. A rug pull is scope creep plus the absence of an evidence trail: nobody can point and say “here’s what this tool’s definition was last week, and here’s what it is now.”

MCP didn’t invent a new class of problem. It took a few old ones (blindly trusting tool output, scope creep, no traceability) and industrialized them behind a single standard, at a hundred million downloads.

The capability got handed over; the governance didn’t come with it.

This is the same shape as the no evidence, no completion principle: a thing isn’t trustworthy because it claims to be, it’s trustworthy because you can check it. A standard interface tells you how to connect. It tells you nothing about whether to trust what’s on the other end.

The patches landing now are just governance

The strongest evidence for that read is the toolkit people are reaching for to fix it.

Where is the 2026 MCP spec heading? OAuth 2.1 (mandatory PKCE, no implicit grant), with servers acting as OAuth resource servers. Incremental scope consent, so a client requests only the minimum access per operation. Resource indicators, so a token can’t be reused where it shouldn’t. An explicit human-in-the-loop requirement, with risk annotations and approval flows for destructive operations. And a governed central registry, so tools are registered once and policy is defined once (the MCP authorization spec tracks most of this).

Read that list back and none of it is a protocol patch. It’s least privilege, human approval, and traceability — old governance principles, being reattached to the protocol one at a time. When the remedy looks exactly like governance, that tells you what was missing in the first place.

A few rules if you’re shipping on MCP

If you’re using MCP today (and if you use Claude Code, Cursor, or anything with MCP wired in, you probably are), here’s what I’d do.

Treat every MCP server as untrusted input. Same posture as LLM output: don’t trust by default, trust when you can verify. A third-party server’s tool description deserves the same suspicion as text a user pasted in.

Grant the minimum scope that works. If a wiki-lookup server asks to read your whole filesystem, that’s not a feature, it’s a red flag.

Pin versions and audit third-party servers. A rug pull depends on a silent update, so don’t let updates be silent. Pin versions, log changes to tool definitions.

Keep a human in the loop for anything irreversible. Deletes, outbound requests, database writes: don’t let the agent decide those alone.

Prefer first-party servers. There are tens of thousands of community servers and they’re convenient. “Convenient” is exactly the soil rug pulls and tool poisoning grow in.

The takeaway

MCP is going to stick around. The problem it solves is too real and the network effects are already in place. But don’t read “it won” as “it’s safe.”

What makes a protocol a standard is how easy it is to connect to. What makes a system safe is whether it’s governed. Those are different questions. The 2026 disclosures aren’t really MCP failing so much as a reminder: as we hand more and more capability to AI to invoke on our behalf, the thing we actually need to add was never a smarter model. It’s sturdier structure.

中文版本：MCP 資安危機：問題不在協定，而在治理

Skill 邊界設計:從能力到合約

KbWen — Mon, 25 May 2026 13:00:00 +0800

TL;DR: 一個 skill 會多可預測,大概就看它的邊界劃得多清楚。把它當成「能力」(這個 skill 讓 AI 會做 X),它容易亂跑;把它當成「合約」(講好輸入、輸出、以及它不會碰什麼),它就比較像一個設計良好的 API。重點不是寫更多「要小心」,而是把它能碰的範圍框住。

我有個 skill,前一天還用得好好的,隔天就開始亂搞:我要它做 A,它順手把旁邊的 B 也「幫我」改了。我第一個反應是怪模型今天狀況不好。後來才想通,問題在我自己——當初寫這個 skill 的時候,我只說了它「會做什麼」,沒說它「不能碰什麼」。

這是英文版的中文對照,英文那篇用 API 設計的角度談;這篇是我自己怎麼從「能力」這個想法,慢慢搬到「合約」這個想法的過程。

能力清單 vs 合約

我們很習慣用「能力」來描述一個 skill:「這個 skill 讓 AI 會跑測試」「這個會幫我部署」。這種講法很自然,也很容易讓你之後被嚇到。因為「會跑測試」這句話沒有邊。它沒講會讀什麼、會動什麼、遇到不在預期內的狀況時會怎麼處理。

合約天生就有邊。它講清楚什麼東西進去、什麼東西出來、以及哪些地方它不會碰。框架文件裡有一句講得比我直接:skill 應該被當成「有版本、受政策約束、能力被框住的封裝」,而不是「一堆鬆散的 prompt 檔」。這跟一個 AI Skill 和 Prompt 到底差在哪講的是同一件事:重點不是 AI「能」做那件事,而是那件事被「講清楚」了。

合約先行:先想清楚它不該做什麼

我看過一個還算貼切的比喻:skill 像是你交給一位很強的廚師的食譜。食譜提供的是結構:材料、順序、限制;廚師提供的是判斷,什麼時候醬汁該再收一下、什麼時候可以換個材料。你不會因為廚師很強,就把食譜寫成「做一道好吃的菜」。

skill 也一樣。模型本身的判斷力很好,所以你要補的不是判斷,是結構。而結構裡最常被漏掉的,就是那條「不要碰」的線。我現在寫一個 skill,會先問自己一個問題:這個 skill 最不該做的事是什麼? 把那條線寫進去,比再多寫三條「請謹慎處理」都有效。

邊界鬆掉,其實是一次沒講的破壞性變更

框架裡有一個原則我覺得很受用:寧可在邊界上把關,也不要去微管模型怎麼想。一個會亂跑的 skill,你通常修不好它的「想法」;你能做的是把它能碰的範圍(哪些檔案、哪些工具、多少預算)框起來。

一個 skill 的範圍如果隨著時間悄悄變大,那其實是你發佈了一次「破壞性變更」卻沒有改版本號。而呼叫它的人(就是你自己)會用最經典的方式發現這件事:在出事的時候。這也是AI 代理常見痛點裡那個「能力邊界」缺口最貴的一種表現形式。範圍,只是它最容易爆出來的地方。

我把一個 skill 從能力改成合約的前後

回到開頭那個亂改 B 的 skill。它原本的描述大概是「整理這個模組的程式碼」。很開放,聽起來很厲害,結果就是它對「整理」的理解跟我不一樣。

我後來把它改寫成比較像合約的樣子:輸入是「指定的那幾個檔案」,輸出是「格式化後的同一批檔案 + 一份它改了什麼的清單」,然後明確寫上「不要新增或刪除檔案、不要碰指定範圍以外的東西」。改完之後它沒有變笨,只是不再自作主張。差別不在能力,在邊界。

順帶一提,邊界清楚的 skill 通常也比較便宜。skill 是漸進載入的:AI 先讀那一小段 metadata 判斷現在用不用得上,真的要用才載入完整內容。一個邊界小而清楚的 skill,光看它的「合約」就能被快速略過;一個什麼都做的 skill,得把整包拖進 context 才發現其實不該用它。這跟我在 Token 成本那篇講的是同一個方向:清楚的小合約,既好預測也比較省。

還在摸索

老實說,一個 skill 該有的合約長什麼樣,我很少一開始就想對。通常是看它在哪裡越界,再把線畫在那裡。但這個框架本身,先講好它承諾什麼、它不碰什麼,然後把對這兩者的任何更動都當成一次正式的改版,到目前為止站得住。

Agentic OS 是開源專案:github.com/KbWen/agentic-os

Skill Design as Interface Design

KbWen — Mon, 25 May 2026 12:00:00 +0800

TL;DR: An agent skill behaves predictably to about the degree its boundary is specified. Described as a capability (“the agent can now do X”), a skill tends to drift. Described as a contract (declared inputs, declared outputs, a scope it promises not to exceed), it behaves more like a well-designed API. The interface-design habits engineers already have (stable contracts, explicit scope, versioning) seem to transfer directly. The framework’s own direction points the same way: skills as versioned, capability-bounded packages, with boundary enforcement instead of micromanaging how the model reasons.

Every team that has shipped a public API has lived through the same lesson: an endpoint that does “roughly what you’d expect” is a liability. The contract (what it accepts, what it returns, what it won’t touch) is the product. The implementation behind it is replaceable. Agent skills seem to be arriving at the same lesson, just faster.

A skill, in Anthropic’s Agent Skills standard (now shared across Claude Code, Codex, and Cursor), is a SKILL.md file: instructions plus frontmatter that controls when it loads and who invokes it. That format makes it tempting to think of a skill as “a prompt the agent can reuse.” That framing is where a lot of the unpredictability starts.

A skill is a contract, not a capability list

The usual way to describe a skill is by capability: “this one runs the test suite,” “this one deploys.” It reads naturally and it sets you up to be surprised. A capability has no edges. “Can deploy” says nothing about what it will read, what it will change, or what it will do when the situation doesn’t match the happy path.

A contract has edges by construction. It names what goes in, what comes out, and what stays out of reach. The framework’s own architecture notes put it more bluntly than I would: skills should be treated as “versioned, policy-governed, capability-bounded packages,” not “loose prompt files.” The same idea runs through what makes a skill different from a prompt: the value is in the specification, not the raw capability.

The principles that transfer

If a skill is an interface, then the design vocabulary engineers already use mostly carries over:

Interface concept	Skill design equivalent
Request / response schema	Declared inputs and declared outputs
Endpoint scope (touches these resources, not those)	Tool, path, and network boundaries
Versioning (semver)	A skill version, so callers know what they’re getting
Breaking change	Scope drift: the skill quietly starts doing more
Deprecation	An explicit lifecycle state, not silent rot

None of this is novel to anyone who has designed an interface. The only move is recognizing that a skill is one, and that “be more careful” is not a substitute for a specified boundary—the same way “use the API responsibly” was never a substitute for a schema.

Scope drift is a breaking change

The most common failure I see isn’t a skill that can’t do its job. It’s a skill that does its job plus two things nobody asked for, because its scope was never drawn.

The framework’s design principle for this is worth borrowing: boundary enforcement over behavior micromanagement. You don’t fix a vague skill by adding more instructions about how to reason well. You constrain what it can reach—which files, which tools, which budget. A skill whose scope widens over time is shipping a breaking change with no version bump, and the caller (you) finds out the way callers always find out: in production. This is the same capability-boundary gap behind several of the common agent pitfalls; scope is just the place it shows up most expensively.

The boundary has a token price too

There’s a second reason to draw the edge tightly, and it connects to cost. Skills load progressively: the agent reads a skill’s small metadata to decide whether it’s relevant, and only pulls the full body when it is. A tightly-bounded skill is cheaper to ignore: its contract is small and quick to probe, where a sprawling one drags its whole body into context to find out it didn’t apply. Interface discipline and token discipline end up pointing in the same direction: a clear, small contract is both more predictable and less expensive.

A skill with a fuzzy boundary isn’t really a capability. It’s an undocumented API, and it will surprise you for the same reasons undocumented APIs always have. I don’t think the right contract for a given skill is obvious up front—I usually find it by watching where a skill oversteps and drawing the line there. But the framing has held up: define what it promises, define what it won’t touch, and treat a change to either as the breaking change it is.

This post is part of a series on building real AI systems. Related: What Makes an AI Skill Different from a Prompt? and Beyond Prompts: From Giving Instructions to Building Systems. A Chinese companion piece on skill boundaries is Skill 邊界設計:從能力到合約. The framework is open source at github.com/KbWen/agentic-os.

Token 成本的真相:分級,但別分太細

KbWen — Mon, 25 May 2026 11:00:00 +0800

TL;DR: 把 token 當成設計變數,不是月底才看的帳單。沒有治理的任務成本沒有上限;但反過來,把任務切得愈細也不會愈省——subagent 不共享快取、TTL 一過就重建,過度切分反而更貴。真正要找的是「對的顆粒度」:夠細到 AI 不會亂跑,夠粗到能一直讀同一份熱快取。這些數字會隨工具一直變,當概念看就好。

有一陣子我根本不看 token 花多少。直到某次一個自動重構的任務跑了大半個晚上,我隔天看用量才意識到一件事:這東西的成本不是「用完才知道」,是我在開始之前就決定好的。

這篇是英文版的中文對照,但角度不太一樣。英文那篇用比較分析的方式談「為什麼治理的成本是可預測的」;這篇是我自己一路試出來的版本,包括一個我原本以為對、後來發現錯的直覺。

Token 成本是開始任務前就決定的

一個任務會花多少 token,大部分在它「被分類」的那一刻就定了:要載入多少 context、要不要去翻所有技能文件、要不要再開 subagent。等你看到數字,數字早就花掉了。

所以「之後再來省 token」通常沒什麼用。貴的決定都在前面:把整個 codebase 讀一遍而不是先看索引、載入完整的技能定義而不是它的 metadata、把一個本來可以接續的工作當成全新的冷啟動重跑一次。這些事後很難補救,你能做的是下次分類分得好一點。

只用 Prompt 和技能,也能做到基本治理裡我提過,最便宜的那層治理,一個 CLAUDE.md 加一句「commit SHA 是什麼」,幾乎不花錢,原因就在這裡:它把決定提前到任務開始之前。

粗放的成本沒有上限

我後來想通的一件事是,token 成本其實分成兩種,差別不在多寡,在有沒有天花板。

有治理的工作有上限。Agentic OS 的 benchmark 裡最重的情境,一個跨多代理協作的架構變更,大約落在 6 萬 token 上下。你可以嫌它高,但很難說它「沒有上限」:它是個數字,事前就知道,而且你不看它的時候它不會自己長大。

放著不管的工作沒有這個上限。當 AI 說「好了」而你手上沒有任何可以查的東西,真正貴的不是它說那句話的 token,是後面所有建立在這個假完成上的工作。這個系列的第一篇有個具體例子:AI 說它實作了三個模組,其中兩個根本沒動過。它報告完成花的 token 很少,失控的是下游每一步都信了這份假報告。少了 evidence 的失敗反而是便宜的那種,你很快會發現;貴的是那種安靜地累積、等你發現已經繞很遠的。

治理在這個角度下,其實不是為了少花 token。它比較像是把花費從「沒上限」那一欄,搬到「有上限」那一欄。

我原本以為切細一點就省,結果不是

想通上面那點之後,我自然得出一個結論:既然結構能把成本框住,那就把任務切到最細吧。這個直覺在遇到快取之後就破了。

快取的算式大概是這樣(2026 上半的數字,而且它一直在動):讀快取大約是正常 input token 的十分之一,寫快取反而比正常還貴,短窗約 1.25 倍、長窗約 2 倍。也就是說,折扣只有在你「重複讀同一份快取」時才出現;每寫一份新的,你都在付溢價。

這就把「愈小愈省」整個翻過來,因為切分和快取有兩個很不友善的互動:

subagent 預設不繼承父代理的快取。 每開一個新的子任務,它通常要為自己需要的前綴重新付一次寫入。一個任務切五份,你可能是買了五次寫入,而不是把一次寫入攤平成很多次便宜的讀取。(現在有些工具開始提供 fork 模式讓子代理重用父快取——這件事本身就說明這個成本真實到值得工程繞過。)
快取有 TTL。 預設窗口很短。如果你把工作切到步驟之間的間隔超過它(某個 subagent 跑太久才回來、某個 fan-out 卡住),快取就過期了,下一步只能用全價重建。

所以一個切太細的任務,最後可能比它取代的那個粗版本更貴:更多寫入、更少讀取、更多重建。同一份 benchmark 也驗證了反方向:一個「載入一次、之後都從快取讀」的接續模型,比每次都整份重讀省了大約一半的執行成本。真正在省的是快取的連續性,不是任務切得多細。

我現在的做法不是追求「最細」,而是去找那個剛好的顆粒度:細到 AI 不會在裡面亂跑(非確定性被框住),又粗到我能一直讀同一份熱快取(不是一直寫新的冷快取)。太粗,錯誤成本會往沒上限的方向飄;太細,快取成本會爬上來。可用的範圍在中間某處,而且它會隨快取行為改變而移動——這部分變得很快。

我實際上怎麼分級

老實說,我心裡的分級沒有很科學,大概是這樣:

改一行、修個 typo 這種,我直接做,連 evidence 都只問一句。一個碰到多個模組的功能,我才會交代清楚範圍、要它先計畫再動手。需要動到架構、或要跨好幾個檔案彼此牽動的,我才會考慮多代理——而且會先停下來問自己,這個任務真的能平行嗎,還是我只是想看起來有在「分工」。

大部分任務其實落在最輕那一級。會出事的,通常是我把一個其實很單純的任務,因為「想用框架」而過度包裝的時候。

還沒定論

這篇講的東西我都還在調整,尤其快取那段,根本是個移動標靶——今天對的顆粒度,一年後可能就不一樣了。我比較有把握的是底下那個形狀:付一個你算得出來的成本,去框住一個你算不出來的成本。至於那條線畫在哪,我也還在試。

下一篇,我想談談記憶:Work Log:跨 session 的記憶機制講的就是當任務跨越多個 session、context 一直重來時,要怎麼把狀態留下來。

Agentic OS 是開源專案:github.com/KbWen/agentic-os

Token Economics of AI Agent Governance

KbWen — Mon, 25 May 2026 10:00:00 +0800

TL;DR: Governance tends to have a token cost you can put a ceiling on. Ungoverned work usually doesn’t; the recovery cost of an undetected error has no obvious upper bound. But the fix isn’t “split everything into the smallest possible pieces.” Caching changes the math: a fresh sub-context pays a cache-write premium and doesn’t inherit its parent’s cached prefix, so over-decomposition can cost more than the coarse version it replaced. The design target, at least with today’s caching, is the granularity that holds error cost and cache cost in check at the same time. It’s a moving target (pricing and model behavior change fast), so treat it as a way of thinking, not a fixed rule.

Most teams treat token spend the way they treat an electricity bill: it shows up after the fact, it’s mildly annoying, and it isn’t something they design around. That habit seems to be where a lot of the trouble starts. Token cost behaves less like a consequence of an agent setup and more like a property of it: decided, the way latency is, by the architecture rather than discovered in production.

Treating it as a design variable leads to two observations. One is reassuring: governance overhead is bounded. The other is the part that “just split the task” advice tends to skip past. Structure has its own cost curve, and past a point, more of it makes the bill worse, not better. Neither of these is a law. Pricing and model behavior in this space move quickly, and some of the specifics below will date. The shape of the trade-off is what seems to hold.

Token cost is a design variable, not an afterthought

Most of what sets a task’s token cost is decided at intake, not during execution. How a task is classified (how much context it loads, how many skills it probes, whether it spawns sub-agents) is largely fixed before any real work happens. By the time the number shows up, most of it was already committed.

That’s probably why “optimize tokens later” rarely gets far. The expensive choices tend to be made up front: reading every file instead of probing an index, loading a full skill definition instead of its metadata, running a continuation as a cold start. Those are hard to optimize away afterward; mostly you just classify better next time. Governing agents with prompts and skills alone is partly an argument that the cheapest governance, a CLAUDE.md and one completion question, costs almost nothing, precisely because it moves the decision to intake.

Bounded cost versus unbounded cost

The asymmetry underneath all of this is between two kinds of cost.

Governed work tends to have a ceiling. In the Agentic OS lifecycle benchmark, the heaviest scenario, an architecture change coordinated across parallel agents, lands around 61K tokens. You can argue about whether that’s high. It’s harder to argue that it isn’t bounded: it’s a number, it was knowable in advance, and it doesn’t keep growing while you look away.

Ungoverned work has no equivalent ceiling, which is where the public cost stories cluster. In one widely reported case, a company that rolled a coding agent out to thousands of engineers reportedly ran through its 2026 AI budget within about four months; a number of cost write-ups collect cases in the same shape. The recurring observation is that a coding agent is one of the first tools where spend isn’t bounded by user intent: it generates as much as you let it.

The first post in this series has the small-scale version of the same shape: an agent reported implementing three modules, two of which were never touched. The tokens it spent saying so were trivial. The unbounded part is everything downstream that trusted the false report. A missing-evidence failure is, oddly, one of the cheaper ones; you tend to find out fast. The costly failures are the ones that compound quietly.

Governance, in this framing, isn’t really about spending fewer tokens. It reads more like a way to move spend out of the unbounded column and into the bounded one.

Granularity has a cost too: caching changes the math

The tempting next step is to decompose everything into the smallest possible units. That runs into where the cost actually lives once caching is involved.

A cache read costs roughly a tenth of a normal input token; a cache write costs more than one, on the order of 1.25× for the short window and 2× for the longer one as of early 2026 (Anthropic’s prompt caching docs carry the current multipliers, and they do move). The discount only shows up when you read the same cached prefix repeatedly. The premium is paid every time you write a new one.

That detail complicates the “smaller is cheaper” instinct, mostly because of two ways decomposition interacts with the cache:

Sub-agents don’t inherit the parent’s cache by default. Each fresh sub-context typically pays its own cache write for the prefix it needs. (Some setups now offer a fork mode that reuses the parent cache, which is itself a sign the cost was real enough to engineer around.) Split a task five ways and you may have bought five cache writes instead of amortizing one across many cheap reads.
The cache has a TTL. The default window is short. Fragment work so the gap between steps runs past it (a sub-agent that takes too long to return, a fan-out that stalls) and the cache can expire, so the next step rebuilds it at full price.

So an over-decomposed task can end up costing more than the coarse one it replaced: more writes, fewer reads, more rebuilds. The same benchmark shows the other direction working: a continuation that loads context once and reads it from cache on later turns cut a feature’s execution cost by roughly half versus re-reading everything each time. Cache locality, more than task count, was doing the saving there.

The design target, at least with today’s caching, doesn’t seem to be “maximally fine.” It looks more like the granularity that contains non-determinism (pieces small enough that the agent can’t wander far) while keeping cache locality (pieces large enough that you keep reading one warm prefix instead of writing many cold ones). Too coarse and error cost drifts toward unbounded; too fine and cache cost climbs. The workable range sits somewhere in between, and it shifts as caching behavior changes, which it does, often.

The SLA/SLO parallel

This rhymes with the reasoning behind service-level objectives. You don’t set an SLO to make a system fast; you set it to make its behavior predictable, turning an open-ended risk into a budget you provisioned for on purpose. You provision a known amount of headroom and monitoring against an outage whose cost you can’t predict in advance.

Token budgeting for agents looks like the same move. The governance overhead (classification, an evidence check, scoped context) is the known cost you pay deliberately. What it fences in is the runaway: the silently compounding error, the cold-start rebuild loop, the helpful sub-agent that rewrote three files nobody asked about. The goal isn’t a smaller bill so much as a more predictable one.

The cheapest place to start is also the easiest to overlook: a project memory file the model reads on every task. AGENTS.md (which began life in OpenAI’s Codex and is now read by tools like Cursor and GitHub Copilot) and CLAUDE.md (Anthropic’s project-memory convention) cost a few thousand tokens that the cache then serves cheaply for the rest of a session. Pair that with one question at completion (what artifact proves this is done?) and you have a usable floor of governance for almost nothing. Most of what sits above it is the same trade at larger scale.

None of this feels settled. The caching math especially is a moving target, and the right granularity a year from now may not look like today’s. What seems stable underneath is the shape: a known cost, paid on purpose, to fence in one you can’t predict.

This post is part of a series on building real AI systems. Earlier posts: Why AI Agents Go Wrong: It’s Not the Model, Prior Art: What Distributed Systems Already Knows, and No Evidence, No Completion. A Chinese companion piece, Token 成本的真相:分級,但別分太細, takes the same topic from a more first-person angle. The framework is open source at github.com/KbWen/agentic-os.

No evidence, no completion

KbWen — Fri, 22 May 2026 20:00:00 +0800

TL;DR: “No evidence, no completion” is a single structural principle: a task isn’t done until the agent produces an artifact that exists outside the conversation and can be checked independently. It sounds trivial. In practice it closes most of the common agent failure modes in one rule, because the act of specifying what evidence looks like, before the task runs, forces you to define what “done” actually means.

In the previous post in this series I described an agent that said a feature was done (commit SHA requested, none existed, two of three modules unchanged). The failure had a name: no external completion criterion existed, so the agent supplied its own. That gap has a one-rule fix.

What “evidence” means here

Evidence is any artifact that exists outside the conversation and can be verified independently of what the agent said.

A commit SHA is evidence. A test output is evidence. A file path with a checksum is evidence. A screenshot of a passing CI run is evidence.

“I implemented it” is not evidence. “The feature is working” is not evidence. A description of what the agent did is not evidence: it’s the agent’s own assessment of its work, which is exactly what you’re trying to verify.

The distinction matters because conversation text is not auditable. It exists only within the session, can’t be pointed to by anyone who wasn’t there, and doesn’t prove the underlying state of the system. An artifact external to the conversation can be checked at any time, by anyone, against the actual state.

Why one rule covers so much

The first post in this series catalogued five structural gaps: no completion criterion, no phase gate, no state handoff, no resource scoping, no capability boundary. The evidence principle doesn’t replace all of them, but it forces the most important one: you cannot specify what evidence looks like without first deciding what “done” means.

If the evidence for a feature task is “passing tests + commit SHA on the feature branch,” you’ve implicitly defined the completion criterion, the scope boundary (the feature branch, not the main codebase), and a checkpoint for the phase gate. The evidence requirement is the handle that pulls the rest of the structure into place.

This is why the distributed systems framing maps so cleanly: delivery acknowledgment in a message queue is exactly this pattern. The queue doesn’t trust the worker’s internal state; it requires an external signal that the job completed. Decades of production systems run on that principle because systems without it fail in the same predictable way.

Before the task, not after

The principle works when it’s applied before the task starts, not as a review step after.

“What would prove this is done?” asked before the work begins forces a design decision. It’s not a check on the agent — it’s a check on the task specification. If you can’t answer it, the task isn’t specified well enough to run. If you can answer it but the answer is vague (“the feature works”), the vagueness is in your specification, not in the agent’s execution.

This is the mechanism Pete Hodgson’s analysis of AI coding tools points toward: when a problem has many valid solutions, the agent will pick one. That one will probably be valid. It probably won’t be the one you wanted. Specifying evidence before the task runs is a way of narrowing the solution space — the agent’s output has to satisfy the evidence criterion, which eliminates the paths that don’t.

In practice: “implement email verification” with no evidence criterion produces one kind of output. “Implement email verification — done when: (1) tests pass for OTP generation and expiry, (2) commit SHA on feat/email-verification” produces a different one. Same model. Different structure around it.

What good evidence looks like

Evidence should be:

External to the conversation. It can be retrieved or verified by someone who wasn’t in the session. A commit SHA can be looked up. A test output can be reproduced. A URL can be visited.

Specific enough to be falsifiable. “Tests pass” is weaker than “running npm test returns exit 0 with 47 tests passing.” The second can be false in a way that “tests pass” can’t — which is the point. If the evidence criterion can’t be falsified, it’s not doing the work.

Proportional to the task. A one-line bug fix doesn’t need a full audit trail. The evidence for a tiny fix is the commit SHA and a grep confirming the old string is gone. The evidence for a feature touching auth, API, and database schema is more involved: test output, migration SHA, API contract diff. The Agentic OS framework classifies tasks before they run partly to route to the appropriate evidence format: a quick-win task and an architecture-change task need different levels of proof.

The cost of specifying evidence

Specifying evidence costs something up front. It takes maybe two minutes to think through “what would prove this is done” before a task starts. That’s real overhead.

The comparison is with recovery cost. A governance failure (completing a task that didn’t actually complete, or completing it the wrong way) typically costs: discovering the error, rebuilding context, rerunning the work, and auditing scope. None of those costs are bounded. The two minutes up front is.

The Agentic OS v1.1 benchmark (April 2026, using chars/4 as the token estimation formula, ±10%) measured governance overhead for a quick-win task at roughly 17,000 tokens: the cost of the full structured lifecycle, evidence requirement included. For a complex feature spanning API design, auth, and database schema, it’s around 51,000 tokens. Those numbers are real costs. They’re also the ceiling. The cost of an undetected wrong completion has no ceiling — it depends on when you find it and how much work built on top of it.

The question to ask before your next task

Before you give an agent its next task: what artifact would prove this is done?

Not “what would it mean to be done” — that’s vague enough for the agent to fill in. What specific artifact, external to the conversation, would you point to afterward and say: here is the evidence this completed correctly.

If you have an answer, you have a completion criterion. If you don’t, you’re delegating the definition of “done” to the agent. It will define one. It almost never matches yours.

This post is part of a series on building real AI systems. The previous posts cover the two-failure taxonomy and the distributed systems prior art that motivates the evidence requirement. The framework is open source at github.com/KbWen/agentic-os.

Work Log：跨 session 的記憶機制

KbWen — Fri, 22 May 2026 18:00:00 +0800

TL;DR： Work Log 是一個很無聊的東西：一份 markdown 檔案，記錄這個任務做到哪裡、做了哪些決定、下個 session 要從哪裡繼續。它沒有解決 AI 的記憶問題，只是繞過它。但在我們找到更好的方法之前，它有效。

在那篇談治理基礎的文章裡，我說過 AI「只活在那一次的對話框裡」。這個說法的代價，在你真正開始用 AI 代理做持續開發的時候才會變得具體。

你跟 AI 花了一個小時討論這個 feature 要走哪個設計模式、為什麼不用另一個方案、資料庫的 schema 要怎麼調整。全部討論清楚了，開始實作。隔天開新對話，繼續做。AI 從頭來：哪個設計模式？資料庫？我不知道你說的是什麼。

這不是 Claude 的問題，也不是任何特定模型的問題。它就是這樣運作的。上一篇說到記憶檔案（CLAUDE.md、AGENTS.md）可以幫助 AI 記住專案的架構規則。但那解決的是「規則要記住」的問題，不是「這個任務做到哪裡」的問題。Work Log 是後者。

兩層記憶，兩個問題

先說清楚兩個東西的差別，因為我發現自己最初混在一起想。

專案記憶：這個專案的架構是什麼、用了哪些 ADR、活躍的任務清單在哪裡、哪些 skill 可以用。這是全域的、靜態的，跟任何一個具體任務無關。你不常去動它，但每次開新 session，AI 需要讀它才知道自己在什麼脈絡裡。

任務記憶（Work Log）：這個任務做到哪個 phase、做了哪些決定、下一個 session 要從哪裡繼續。它是動態的、per-task 的。一個任務一個檔案，在 Agentic OS 裡放在 .agentcortex/context/work/.md（完整結構見 repo）。

混在一起的後果是：要麼全域狀態被塞滿具體任務細節（之後沒人看得懂），要麼任務進度沒地方記（每次都從頭）。分開之後，兩個問題各有各的解。

Work Log 長什麼樣子

下面是一個簡化版的實際樣子（來自 github.com/KbWen/agentic-os）：

# Work Log: feat/email-verification

## Header
- Branch: feat/email-verification
- Classification: feature
- Current Phase: implement
- Checkpoint SHA: a3f9c12

## Task Description
新增 email OTP 驗證流程。使用者第一次登入後需完成驗證，
未驗證帳號只能讀取，不能寫入。

## Phase Sequence
| Phase     | Status      | Notes                    |
|-----------|-------------|--------------------------|
| bootstrap | completed   | 分類為 feature           |
| plan      | completed   | 確認走 OTP 不走 magic link |
| implement | in-progress | auth module 完成，email 發送待測 |
| review    | pending     |                          |

## Gate Evidence
- Gate: plan | Verdict: pass | At: 2026-05-10T14:00Z
- Gate: implement | Verdict: FAIL | Reason: email sending untested, scope not complete | At: 2026-05-11T09:00Z

## Phase Summary
- plan: 討論了 OTP vs magic link。決定用 OTP，因為我們的 email
  provider 有速率限制，magic link 的 retry 設計複雜度更高。
  這個決定要記住，下個 session 不要再討論。

關鍵不在格式，在那個 Phase Summary。每個完成的 phase，AI 要用一段話說：做了什麼決定、為什麼這樣決定、有什麼取捨。

這段話的作用不是給人讀的，是給下一個 session 的 AI 讀的。這個差別值得注意：給人讀的語言習慣加很多語境鋪墊，給 AI 讀的語言要決策密度高、歧義少。同樣一個決定，給人讀可能寫「考量效率後決定用 OTP」，給 AI 讀更好的形式是「用 OTP，不用 magic link，原因：provider rate limit + magic link retry 複雜度高，此決定封存，不再重新討論」。後者直接進入 AI 的 context，前者需要它自己推斷。

新對話開始，AI 讀了這份 Work Log，知道「OTP vs magic link 已經決定過了，不用再想」。它不會再建議你改用 magic link，因為那個決策已經被記錄並封存。

哪些東西值得記

不是所有事情都要寫進 Work Log。先說限制：一份幾百行的 Work Log，AI 讀完之後注意力也稀釋了。我們設定的上限是每個 Phase Summary 一段話，不超過五句。有這個上限，才值得想清楚什麼東西最值得占那個位置。

從我的觀察來看，排最前面的是這幾樣：

決策，尤其是否定的決策。 你決定不做某件事的原因，比你決定做某件事更容易被遺忘。「我們用 OTP 不用 magic link，因為 rate limit 問題」——如果沒記，下個 session 的 AI 大概又會建議 magic link。

當前 phase 的狀態。 做到哪一步、什麼東西是完成的、什麼還沒做。這讓新 session 可以從中間接續，不是從頭。

還有一類東西最容易被忽略：你在 implement phase 發現了一個沒有答案的問題。不要默默繼續，也不要讓 AI 自己想辦法繞過去。寫進去，下個 session 一開始就正面面對它。這類問題如果沒記，AI 下次遇到同一個岔路，十之八九走錯方向——不是因為它笨，是因為它不知道你已經知道那條路走不通。

這個方法的真實限制

說清楚它做不到的事，比說它能做什麼更重要。

Work Log 解決的是「把決策外化」的問題。AI 把決策寫在外部，下次讀回來，行為才一致。它沒有解決 AI 的狀態記憶問題，因為那個問題的解決需要模型架構層面的改變。Work Log 只是個繞路方案：既然 context 不能跨 session 存活，我們就把最重要的東西寫成文件，在 session 開始時重新注入。

這個繞路有個天花板。任務夠複雜的時候，Work Log 本身也會膨脹。你開始注意到你在寫一份文件，讓 AI 讀這份文件，再根據它繼續工作——而不是直接繼續工作。整個過程變得笨重。

還有一個現在可能不用擔心、但以後要注意的事：prompt cache 機制。Claude 和其他主流模型都有 prompt cache，在同一個 session 內重用相同 context 的成本很低（以 Claude 為例，cache TTL 大約是 5 分鐘到 1 小時）。如果你的任務可以在一個 session 裡完成，Work Log 的 ROI 其實有限——cache 幫你保住了 context，不用依賴外部記錄。Work Log 真正發揮的地方，是跨越多個 session 的任務，也就是 cache 早就失效的那種。

我們把 Agentic OS 的 Work Log 定位為一個夠用的暫時解，不是最終答案。AI 工具的 native memory 機制在快速發展，現在適合加 Work Log 的任務類型，一兩年後可能模型自己就能處理。這個觀察在整個系列的第一篇裡也說過：任何固化的解法都有保鮮期。

如果你想試試看

最輕量的開始：開一個任務，在任務開始前新增一個 markdown 檔案。三個區塊就夠：任務目標（一句話）、已決定的事（每次做了決定就加一行）、目前停在哪裡（每次結束 session 更新）。

不需要完整的 Work Log 格式。這三個區塊能擋掉大部分「下個 session 從頭來」的問題，原因很簡單——它逼你在 session 結束之前把當前狀態說清楚，而不是留給下一個 AI 自己猜。做複雜了再引入完整的 phase 結構，不用一開始就全部上。

Agentic OS 完整的 Work Log 模板在這裡，包含 phase 定義、gate evidence 格式和 handoff 結構。如果你每次開新對話的前 10 到 15 分鐘都在重新交代背景，而不是做事，那就是加 Work Log 的時機了。

這篇是 Agentic OS 系列的一部分。相關閱讀：只用 Prompt 和技能，也能做到基本治理說的是更輕量的做法，Work Log 是在那個基礎上加一層。AI 代理常見痛點與我們的嘗試是這個系列的入口。

Prior art: what distributed systems already knows

KbWen — Fri, 22 May 2026 16:00:00 +0800

TL;DR: The governance problems that make AI agents unpredictable (unverified completions, state loss between sessions, unconstrained scope) are structurally identical to problems distributed systems engineering solved with audit logs, delivery acknowledgment, state machines, and least-privilege access. The one genuine difference is non-determinism: an agent given the same open-ended task twice will do something different, which means governance needs to front-load constraints rather than just catch failures after. But the rest of the pattern library applies directly.

If you have built a message queue, you have hit a version of this bug: a worker picks up a job, does the work, then fails before sending the acknowledgment. The queue marks it undelivered. The job runs again. Now you have a duplicate record, a double email, or worse, depending on what “the job” was.

The fix is well-understood: require the worker to produce evidence of completion that the system can verify externally. Don’t trust the worker’s internal state. Trust the artifact.

When an AI agent says “done” and you have no artifact to check against, that’s the same design gap. The previous post in this series has a concrete example: the agent said the feature was done, I asked for the commit SHA, there wasn’t one, and two of the three modules it described implementing hadn’t changed. A capability failure looks like wrong reasoning. This was neither: the agent completed exactly what it was given, through its own completion criterion, because no external one existed. The fix is in the surrounding structure.

Distributed systems already solved the worker-reliability problem. The patterns map directly.

What agent execution looks like from the outside

Strip the language model out for a moment. What’s left?

A task arrives. A worker picks it up, performs operations, and signals completion. The orchestrator decides what to do next.

Standard async task pipeline. The governance questions are the same ones distributed systems have always asked: Did the work actually happen? What state is the system in now? What was the worker allowed to touch?

The answers (delivery acknowledgment, audit logs, state machines, capability sandboxing) aren’t novel. They exist because systems without them fail in predictable, documented ways. Agent deployments running without that structure encounter the same failure modes.

The pattern mapping

Distributed systems pattern	Agent governance equivalent
Delivery acknowledgment	Every task completion requires an external verifiable artifact: commit SHA, test output, file path
Idempotency key	Task dispatch is deduplicated: same task classified and scoped the same way, regardless of retry
Audit log / event sourcing	Work Log: decisions recorded at the time they happen, not reconstructed from memory later
State machine with explicit transitions	Phase gate: plan before implementing, review before shipping, with real entry/exit conditions
Least privilege / capability sandbox	Agent’s tool access scoped to what the specific task requires, not everything available
Resource quota	Task classification that routes work to an appropriately sized execution path before it begins

The Agentic OS framework is essentially this table implemented as a working system, not because it invented these patterns, but because building it kept arriving at the same structural answers distributed systems already had. The evidence requirement feels new until you recognize it as a CI gate. The work log feels novel until you recognize it as event sourcing. The insight isn’t original; it’s just overdue.

The one place the analogy breaks

Distributed systems assume deterministic workers. Same input, same output, retry is safe.

Agents aren’t deterministic, at least not for open-ended tasks. The same prompt, the same tools, the same context: execution goes somewhere different. Sometimes better. Often just different. For well-scoped sub-tasks (“run these tests and report failures,” “format this JSON to this schema”), retry still works fine. But for the tasks where governance matters most (feature implementation, refactoring decisions, scope-touching work), retry isn’t a recovery strategy; it’s another roll.

This is what Pete Hodgson’s analysis of AI coding tools points toward: when a problem has many valid solutions, the probability that an agent independently lands on the one you wanted approaches zero. The governance implication is that task decomposition is itself a governance act. Break work into pieces small enough that non-determinism is contained. Then front-load the constraints on the pieces that remain open-ended: define what “done” means, specify which files are in scope, classify the task before the first tool call.

The circuit breaker in distributed systems stops a cascade after failures accumulate. The agent equivalent is not letting the cascade start.

Where to instrument

Distributed systems tell you to instrument at the transition points: message intake, worker pickup, task completion, downstream dispatch. These are where state changes happen and where failures manifest.

The agent equivalent:

Task intake: Is this classified correctly? What phase path follows? What tools does it need, and only those?
Phase completion: What artifact exists to prove this phase is done? Is it external to the conversation?

The third transition point is worth more than a bullet. Session boundary is the agent-specific failure mode that has no clean distributed-systems equivalent: it’s closer to a stateless worker that loses its in-memory state and reprocesses from the queue head on restart. An IEEE Spectrum report on AI coding tools documented the pattern: in longer sessions, agents increasingly regenerated functions that already existed and ignored conventions established earlier. The fix is identical to the queue case: persistent state external to the worker. In agent terms: a work log that records decisions at the time they’re made, so the next session inherits context instead of reconstructing it.

Which gaps cost the most

The distributed systems frame doesn’t just explain why agent governance looks the way it does — it tells you which gaps cost the most.

Missing completion verification produces the cheapest failures: you find out fast. Missing scope constraints produce the expensive ones: the agent did three things you didn’t ask for, two of which were correct, and now you’re auditing which is which. Missing session state produces the hidden ones: the agent solved a problem you already solved, using a pattern you already decided against, because it had no way to know.

If you’re choosing where to add structure first: start with scope. The task intake gate is the circuit breaker — it constrains what the agent can reach before it runs. The work log is the audit trail you need after something goes wrong. The completion artifact is the acknowledgment the queue was never getting.

Add them in that order.

This post is part of a series on building real AI systems. The previous post, Why AI Agents Go Wrong: It’s Not the Model, covers the capability vs. governance failure taxonomy that motivates this framing. Next: No Evidence, No Completion takes the evidence requirement as a standalone principle and shows what it looks like in practice. The framework is open source at github.com/KbWen/agentic-os.

只用 Prompt 和技能，也能做到基本治理

KbWen — Fri, 22 May 2026 14:00:00 +0800

TL;DR： 在裝任何框架之前，有一層治理是免費的：在專案根目錄放一個 AGENTS.md 或 CLAUDE.md，養成開口要求 evidence 的習慣，開始任務前先說清楚什麼不能動。這三件事不能替代跨 session 的狀態管理，但能擋掉大部分常見問題。這篇說的就是怎麼做、做到什麼程度、在哪裡會失效。

有一段時間我的 Claude Code 工作流裡沒有任何框架，只有對話和一堆臨時 prompt。某天我做了兩個改變：把專案的架構決策寫進一個 CLAUDE.md，還有在每次 AI 說「好了」的時候問一句「commit SHA 是什麼？」

一類問題幾乎消失了：AI 在新 session 裡對著不存在的設計模式寫程式碼的情況，以及我接受了「完成」卻發現什麼都沒變的情況。不是所有問題都解決了。但那兩件事的性價比，讓我後來開始認真想「在裝框架之前，這個層面的治理到底能做多少」。

這篇是AI 代理常見痛點與我們的嘗試的延伸。那篇列了五個反覆出現的問題，這篇專門回答：只靠 prompt 習慣和 skill 選擇，能解決多少？

記憶檔案：解決跨 session 失憶的最低成本方案

AI 代理在每一個新對話都是空白狀態。它不記得上次的架構決策，不記得你說過不要用哪個 pattern，也不記得你已經有一個 utils/auth.ts，所以它再寫一個新的。這個問題在 IEEE Spectrum 的報導裡有量測數據：長 session 後期，AI 重複生成已存在函式、忽視早期建立的 coding convention 的頻率明顯上升。

三個工具在試圖解決同一個問題：

AGENTS.md 是 OpenAI Codex 最初設計的慣例，後來被 Cursor、GitHub Copilot 和 Google Antigravity 等主流工具廣泛採納。它的設計邏輯是：在任何工具讀取它之前，先告訴工具「這個專案是怎麼運作的、你可以做什麼、不可以做什麼」。

CLAUDE.md 是 Anthropic 針對 Claude Code 的版本。Claude Code 在每個新 session 開始時自動注入這個檔案的內容，所以你放在這裡的東西就等於是每次都在對話開頭重新說一遍。

.cursor/rules 是 Cursor 的對應物。原理相同。

這三個慣例同時存在，說明「怎麼讓 AI 記住專案規則」這個問題是通用的，不是某個工具特有的。選哪個取決於你主要用什麼工具，你不需要三個都放，放一個就有效果。

這類記憶檔案最有用的內容通常是三類：架構限制（「這個 repo 用 Repository pattern，不要把業務邏輯寫進 controller」）、命名規範（「service 命名用 XxxService，不要用 XxxManager」）、以及「不要碰」清單（「/database/migrations 只有在明確被要求的時候才能動」）。

一個重要的注意：這類檔案要短。研究觀察和實踐都指向同一個上限：200 行、2000 token 以內。超過這個長度，重要的規則會被稀釋。AI 技術上還是讀了整個檔案，但前面讀到的東西到後面已經注意力不足。寫 CLAUDE.md 的時候，如果你覺得需要加第六條規則，先問自己第一條能不能刪掉。

Skill 選擇：要求愈具體，干擾愈少

在 Claude Code 或 Cursor 的一次工作 session 裡，你可以載入很多 context：整個 codebase 的 README、過去的對話歷史、多個技能文件。但「載入愈多愈好」是個陷阱。

一個改一行 typo 的任務，不需要知道整套測試策略、部署規範和 API 設計原則。把這些全部塞進 context，不會讓 AI 更謹慎，只會讓它在「哪些規則現在適用」這件事上分配更少的注意力給真正重要的那個。

這不是 Agentic OS 特有的問題，是任何 Claude Code 或 Cursor session 都存在的情況。具體做法是：開始一個任務之前，先想清楚這個任務需要知道什麼，然後只提供那些。一個 tiny-fix 說「這是那行 code，幫我修」就夠了；一個涉及多個模組的功能開發才需要交代設計模式、測試策略和資料庫規範。

結果是給了 AI 密度更高的相關資訊，不是更少的資訊。

Evidence 習慣：不問則不說

這是成本最低的一個改變，也是讓我最驚訝的一個。

AI 說「完成了」的時候，它有可能真的完成了，也有可能完成了 90% 然後遇到小問題就繞過去了，也有可能整個理解方向就錯了。這三種情況在它的輸出裡，有時候看起來幾乎一樣。

養成一個習慣：在接受任何「完成」之前，要求一個具體的 artifact。

不是一個表單，也不是一套流程，就一句話：「commit SHA 是什麼？」「把 test 跑一遍，貼輸出給我」「你改了哪個檔案，第幾行？」

這個習慣有效的原因不只是讓你可以查。問這個問題本身會讓 AI 把它沒說清楚的地方說出來。 很多時候，我問「測試有過嗎」，它才會說「啊，那個測試我還沒跑，因為 X 的 setup 有問題」，而這個資訊如果我沒問，它可能就默默略過了。

誠實地說：這個習慣很累。問了十幾次之後你開始理解為什麼人們想要自動化這件事，框架裡的 evidence gate 就是把這個問答自動執行。但作為一個習慣，它能擋掉大概六七成的「接受了看起來完成的東西、後來發現沒有」的情況。

範圍宣告：先說不要碰什麼

開始一個複雜任務之前，明確告訴 AI 它應該不要碰什麼。

具體的說法比模糊的說法有效：「你在做 authentication module。除非我明確說，不要碰 /api/payments 和 /database/migrations 底下的任何東西」比「專注在 auth 就好」有用得多。

原因不完全是「AI 會遵守」，它不是每次都遵守。而是宣告了邊界之後，AI 在不確定的時候開始問問題而不是自己決定。我給了這樣的指令之後，在它原本會直接去改 payments module 的地方，它變成問我「這邊需要我更新 payment 的驗證邏輯嗎？」——這個轉變很有價值。

這個觀察跟 Pete Hodgson 對 AI coding assistant 失效模式的分析有直接的關係：當一個問題存在很多可能的解法，AI 選中你心目中那個的機率趨近於零。把解法空間縮小（也包括把「不能碰的部分」明確劃出來），大幅提高了它走向你要的方向的機率。這是流程問題，跟模型能力無關。

在從「下指令」到「蓋系統」裡，我說過「AI 只活在那一次的對話框裡」。宣告範圍是在這個限制之內，盡量讓它知道那個對話框的邊界在哪裡。

這個層面的治理做到什麼、做不到什麼

做得到的：讓 AI 在新 session 裡記得你的架構決策（記憶檔案）。讓它在不確定的時候問你而不是自己做決定（範圍宣告）。讓你在接受輸出之前有一個具體的查核點（evidence 習慣）。把技能文件控制在合理長度，避免注意力被稀釋（skill 選擇）。

做不到的是跨 session 的連貫狀態。記憶檔案解決的是「規則記得住」的問題，不是「上次做到哪裡」的問題。如果你的任務橫跨多個 session，每次開始你還是要手動交代背景——或者接受 AI 從頭重推一遍。Evidence 習慣的疲勞感也是真實的：問個五十次之後，你會想要自動化。這不是壞事——這是你已經知道在哪裡需要更正式的結構的訊號。範圍宣告在複雜任務下同樣會降解，涉及的模組愈多，「先說不要碰什麼」就愈難窮舉。

這個層面的治理是真實的，不是「沒有框架的窮人版」。但它有天花板。當你開始覺得每次的 context 交接很重複、evidence 問答讓你厭倦、範圍宣告的清單比任務本身還長——那就是你已經碰到這個層面的邊界了。

下一篇：Work Log：跨 session 的記憶機制

Agentic OS 是開源專案，記憶檔案的範本和設計說明都在這裡：github.com/KbWen/agentic-os

Why AI Agents Go Wrong: It's Not the Model

KbWen — Fri, 22 May 2026 12:00:00 +0800

TL;DR: “The agent did something wrong” usually gets diagnosed as a model problem. Most of the time it isn’t. Capability failures (wrong reasoning) and governance failures (no structure to catch wrong reasoning) look identical from the outside but need completely different fixes. This post is about telling them apart, and why most teams are currently solving the wrong one.

The agent said the feature was done. I asked for the commit SHA. There wasn’t one. When I checked the branch, two of the three modules it described implementing hadn’t changed.

The instinct in that moment is to reach for a better prompt, a smarter model, maybe a different tool call. That instinct is usually wrong.

What happened wasn’t a reasoning failure. The agent completed exactly the task it was given, interpreted through its own completion criterion, because no explicit one existed. There was no audit trail to check what it actually did. There was no scope boundary to constrain what “done” even meant. The model behaved correctly inside a system that gave it no structure to behave correctly toward.

That’s a governance failure, not a capability failure. And the fix is not a better model.

Two failure modes that look the same

When an agent produces bad output, the failure is almost always categorized as one thing: the AI got it wrong. Which leads to one solution category: better AI.

The problem is that “the AI got it wrong” conflates two distinct failure modes that have nothing to do with each other.

Capability failure: the model reasoned incorrectly. It missed a constraint, hallucinated a fact, drew a wrong inference. The fix lives in the model layer: better prompt, better retrieval, better fine-tuning, sometimes a more capable model.

Governance failure: the system had no invariant to catch or prevent what the agent did. The agent may have reasoned perfectly well and still produced a wrong outcome, because the surrounding structure gave it nothing to constrain against.

There’s a useful diagnostic test: would a smarter model have prevented this?

If yes, if the failure was clearly about incorrect reasoning or a factual miss, that’s a capability failure.

If no, if a brilliant expert given the same underspecified task would have made the same wrong choice, or a different wrong choice, because the task itself had no defined success condition. That’s a governance failure. Upgrading the model doesn’t help.

Most of the “unpredictable agent” complaints I’ve seen are governance failures. The problem gets framed as model unreliability because that’s what’s visible. The actual cause is invisible: the absence of structure.

The five structural gaps

These are the governance gaps that show up repeatedly, not as edge cases, but as the default state of most agent deployments. The zh-TW companion post AI 代理常見痛點與我們的嘗試 goes deeper on each one with narrative examples. Here I want to name the structural invariant that’s missing in each case.

Output not verifiable → no completion criterion or audit trail. The agent says “done.” You have no artifact to check against. The agent’s word that something happened is not evidence that it happened. The missing invariant: every task completion requires an attached evidence artifact: a file path, a commit SHA, a test result, something external to the conversation.

Steps skipped → no phase gate. Given a complex task, agents move toward output by the shortest path. Scope-setting, dependency mapping, impact analysis (anything that doesn’t look like “doing the thing”) gets skipped. The missing invariant: phases with entry and exit conditions that must be satisfied before proceeding. Pete Hodgson has written about this from an angle worth noting: when a problem has many valid solutions, the probability that an agent independently arrives at the one you actually wanted approaches zero. Pre-alignment isn’t overhead. It’s the phase gate that prevents redoing work.

Cross-session amnesia → no state handoff mechanism. Every new conversation is a blank slate. Decisions made in session one are unknown in session two. The agent rediscovers problems you’ve already solved, proposes patterns you’ve already rejected, rebuilds context you’ve already paid to build. An IEEE Spectrum report on AI coding tools documented this concretely: in longer sessions, agents increasingly regenerated functions that already existed and ignored conventions established earlier in the same session. The missing invariant: a structured work log that carries decisions forward across session boundaries. The mechanism we use is stupid-simple. It’s essentially forcing the agent to keep a diary. That description isn’t flattering, but cross-session amnesia is real enough that stupid-simple works.

Unbounded token cost → no resource scoping. An agent given a large task will read everything it can find, activate every relevant capability, and use as much context as the task allows it to justify. Without resource scoping, costs are unpredictable and you have no way to set expectations before a task starts. The missing invariant: task classification that routes to appropriately sized execution paths before the task begins.

Scope creep → no capability boundary. This is the quietest failure mode. The agent does what you asked, and also reorganizes a module you didn’t ask it to touch, and also “helpfully” updates a config file while it was in the neighborhood. Security researcher Johann Rehberger (Embrace the Red) made this failure mode concrete in April 2025 when he spent $500 testing Devin AI’s response to embedded instructions in GitHub issues, then reported the results to Cognition: 84–85% of attacks succeeded in getting the agent to execute actions outside the intended scope. That’s an extreme case, but the everyday version of this (the agent quietly expanding what “done” means) is the same structural gap. The missing invariant: explicit capability boundaries that define what the agent is allowed to do, not just what it’s been asked to do.

None of these gaps are model problems. A more capable model, given the same absent structure, makes the same category of errors, just more convincingly.

Engineering already solved these problems

These aren’t new problems with new solutions. They’re old problems that software engineering solved decades ago, applied to a different execution substrate.

Governance gap	Engineering equivalent
No completion criterion	CI gate: no merge without passing checks
No phase gate	PR review requirement: code doesn’t ship without sign-off
No state handoff	Audit log / ADR: decisions are recorded, not reconstructed
No resource scoping	Budget / SLA: bounded cost before work starts
No capability boundary	Principle of least privilege: access limited to what the task requires

The analogy isn’t decorative. These are the same structural mechanisms. Building the evidence requirement for Agentic OS, I kept writing things that felt novel until I realized I was describing CI gates and audit logs with different names. The insight wasn’t new. It was just late.

A CI gate doesn’t trust the developer’s word that the tests pass. It requires evidence. An audit log records decisions at the time they’re made, so they don’t need to be reconstructed from memory later. Least privilege limits what an agent can touch, not out of distrust, but to contain the blast radius when something goes wrong.

The AGENTS.md convention, now adopted across Claude Code, Cursor, and GitHub Copilot as a standard way for agents to load project context, is essentially a machine-readable project governance document. It’s the same idea as a team’s architecture decision record, but in a format the agent reads automatically. That’s not a coincidence. It’s the same structural need surfacing in a new context.

What’s missing in most agent deployments isn’t better AI. It’s the application of mechanisms that software engineering already knows work.

What governance actually costs

“Adding structure” sounds like adding overhead. It’s worth being concrete about the actual numbers.

We measured governance overhead across several task types in Agentic OS v1.1 (April 2026, using chars/4 as the token estimation formula; actual counts vary by ±10% depending on tokenizer). For a quick-win task (something like fixing a date format in a CSV export), the governance overhead came to 17,041 tokens. For a complex feature touching API design, authentication, and database schema, it came to 50,975 tokens.

Those numbers sound large until you compare them to the cost of an ungoverned failure. A governance failure typically means: an undetected wrong completion that gets discovered later, a context restart, redone work, and scope cleanup. None of those costs are bounded or predictable.

The governance overhead is bounded. It scales with task complexity in a predictable way: the lightest path costs roughly 17K tokens; the heaviest measured scenario costs under 62K. The cost of recovering from a scope error or a missed completion criterion is not bounded. It depends on when you find it.

This isn’t an argument for any particular framework. It’s an argument for the structure itself: known, upfront cost versus unbounded, discovery-time cost. That trade-off is the same one CI gates resolved for software deployment twenty years ago.

The question to ask before the task starts

None of this requires a framework. The diagnostic test at the task level is simpler than that.

Before your next agent task: what artifact would prove this is done?

Not “what would it mean to be done.” That’s vague enough that the agent will fill in the answer. What artifact, specifically, would you point to afterward and say: here is the evidence this completed correctly?

If you can answer that question before the task starts, you have a completion criterion. If you can’t, you don’t, and the agent will invent one. That invented criterion is almost never the one you wanted. The everyday version doesn’t look like a security incident. It’s an agent that quietly refactored a module you didn’t mention, or updated a config file it found nearby. Its completion criterion included those things. Yours didn’t.

That’s the smallest possible governance structure. A definition of done, stated before work begins, tied to something observable.

The rest of the gaps (phase gates, state handoffs, resource scoping, capability boundaries) are the same logic applied at increasing scope. But they all start from the same place: deciding what “done” means before asking the agent to find out.

These observations are from building and using Agentic OS v1.1 (April 2026). The field moves fast — if a model capability has improved or a pattern here no longer holds, I want to know. The framework is open source and the issues are open: github.com/KbWen/agentic-os.

This post is part of a series on building real AI systems. Related reading: What Makes an AI Skill Different from a Prompt? covers the capability abstraction layer that sits below agent orchestration. The zh-TW companion post AI 代理常見痛點與我們的嘗試 covers the same failure catalogue with more narrative depth. Both build on Beyond Prompt: From Instructions to Building Systems.

AI 代理常見痛點與我們的嘗試

KbWen — Fri, 22 May 2026 10:00:00 +0800

TL;DR： AI 代理失控通常不是模型的問題，而是缺少足夠的結構。這篇整理了我們在實踐中觀察到的幾個痛點，以及 Agentic OS 試著用哪些方向來應對——不保證這是最好的做法，AI 工具本身也還在快速演化。

如果你已經在用 Claude Code、Cursor 或 Copilot 一段時間，你大概知道那種感覺：有時候它快得讓你懷疑自己為什麼還要打字，但有時候你盯著它的輸出，心裡只有一個念頭——「等等，它在幹嘛？」

印象更深的往往是後者。我發現有幾類問題會反覆出現，跟你用哪個模型或哪個工具關係不大，比較像是讓 AI 代理參與真實開發這件事本身帶來的結構性挑戰。

如果你讀過從「下指令」到「蓋系統」，這篇可以看成那個思路的延伸——當你開始用 agent 做真實開發，「結構不夠」這件事的代價變得具體很多。

Agentic OS 是站在很多公開工作的肩膀上做出來的。AGENTS.md 這個慣例最初來自 OpenAI Codex 的設計，後來被 Cursor、GitHub Copilot 等主流 AI 工具廣泛採納；Anthropic 有自己的 CLAUDE.md；Cursor 有 .cursor/rules——各自代表不同工具對「怎麼讓 AI 記住專案規則」這個問題的嘗試。我們參考了這些設計，加上 Hacker News、Reddit 社群裡的實測討論，還有 Pete Hodgson、Addy Osmani、Thorsten Ball 等工程師整理的失效模式分析，試著把它們整合成一套對我們自己有用的東西。這個框架比較像是整合與實驗的產物，不是從零發明的。

幾個反覆出現的痛點

以下整理自我們自己踩過的坑，也有部分來自社群的集體觀察。不是嚴謹的研究，是實踐者的筆記。

輸出難以核查

AI 完成任務後，你拿到的往往是一段文字說「已完成」或「功能已實作」。問題是「完成」的依據是什麼？在單一短對話裡這不是大問題，但一旦任務橫跨多個 session，或者事後需要追溯某個決策的來源，你往往什麼都找不到——沒有 commit SHA、沒有測試輸出、沒有可以指著說「它在這裡」的東西。只有對話紀錄，而對話紀錄不算數。

這個問題後來直接影響了我們的框架設計。Agentic OS 裡有一條規則：就算是「重讀同一份文件」這個動作，也必須留下一筆收據。聽起來很囉嗦，但沒有這個，「我讀過了」和「我沒讀過」在紀錄裡是完全一樣的。

跳過中間步驟

給 AI 一個任務，它的自然傾向是直接往結果走。這在小任務上沒問題。但任務稍微複雜一點——比如需要同時異動前端、後端和資料庫——省掉的「先確認範圍」、「列出影響的模組」這些步驟，往往要在後面以更大的代價補回來。工程師 Pete Hodgson 在他的文章裡提到，當一個問題有很多不同的解法時，AI 選到你心目中那個的機率趨近於零——提前對齊方向，跟模型能力無關，是流程問題。

跨對話的連貫性

在那篇談 Prompt 局限的文章裡，我說過 AI「只活在那一次的對話框裡」。這個限制在用 agent 做持續開發的時候感受更強烈。每次開新對話，你得重新交代背景：這個專案的架構決策是什麼、上次決定用哪種設計模式、之前踩過什麼坑。這不只是麻煩，而是會讓同樣的問題被重新發現、同樣的決策被重新討論。IEEE Spectrum 的一篇報導裡提到，AI 在長 session 的後期，出現重複生成已存在函式、忽視早期建立的 coding convention 等情況的頻率明顯上升——本質上是 context 稀釋的問題。

資源使用的不確定性

AI 代理讀文件、呼叫工具、產生輸出，這些都有成本，而且差距可以很大。我們在 Agentic OS v1.1 的 benchmark 裡（2026 年 4 月量測）跑了幾個真實場景：quick-win 等級任務（例如修一個 CSV 格式問題）實際消耗約 17,041 token；涵蓋 API、認證、資料庫的複雜功能開發則約 51,000 token，相差接近三倍。這些數字來自特定的任務類型與工具組合——我們用的估算公式是 chars / 4，接近多數 OpenAI tokenizer，但不完全一致——不同模型、context 策略下的結果可能差距顯著。

更複雜的是，這個計算現在又多了一層變數。主流模型——包括 Claude 和 OpenAI 的系列——已經有 prompt cache 機制，在某些條件下可以大幅降低重讀相同 context 的成本。這讓我們原本關於「怎麼控制 context 讀取策略」的很多設計假設需要重新檢視。我們還在觀察這個演變，舊的建議不一定還適用。

範圍的模糊

這類問題比較難描述，因為它不一定會報錯——它只是靜靜地做了你沒有要求它做的事。安全研究員 Johann Rehberger（筆名 Embrace the Red）花 $500 測試了 Devin AI 的 prompt injection 抵抗力，並於 2025 年 4 月將結果通報給 Devin 的開發商 Cognition。測試結果顯示透過 GitHub issue 嵌入惡意指令，可以讓 Devin 執行預期範圍以外的操作，整體攻擊成功率達 84–85%。這是極端的例子，但「AI 自己決定任務邊界」這件事的普通版本，每天都在發生——它只是偷偷多改了一個 config 檔，或者順手重構了你沒說要動的模組。

我們試著做的事

Agentic OS 的出發點，是試著在這些問題上加一些結構。主要思路有幾個方向：

我們把核心原則叫做 “No Evidence = No Completion”——想法本身不新奇，軟體工程裡的 CI/CD gate 做的就是這件事，只是把它搬到了 AI 代理的工作流程裡。每個任務的交付都要附帶某種形式的 evidence，不一定很複雜，但要有東西可以查。同時，根據任務的規模，要走的流程也不一樣：單行改動走輕量路徑；功能開發走比較完整的流程，包含計劃、實作、審查幾個階段。這個分層設計部分參考了 Anthropic 和 Cursor 社群分享的做法，調整成對我們自己比較實用的版本。

用 Work Log 保持連貫性。 每個任務有一份對應的工作記錄，記關鍵決策和目前狀態，讓下一個 session 能接續而不是重來。這是個很笨的方法（基本上就是強迫 AI 寫日記），但在我們找到更好的方式之前，它目前還算有用。

至於資源分配，我們試著把不同分類的任務對應到不同的 skill 載入策略，不一次讀所有東西。不過如前面說的，model cache 機制的演進讓這部分的設計面臨一些調整，舊的策略不一定還有效。

一些誠實的話

這套框架有用，但不是沒有問題——有些設計現在回頭看也不一定是最好的決定，只是當時看起來合理。Addy Osmani 把這個現象稱為「70% 問題」：AI 能很快帶你到 70% 的完成度，但剩下的 30% 往往需要更多工程判斷力，不是更少。設計一套治理框架也一樣——結構能幫你避開很多坑，但它改變不了你還是需要做設計決策這件事。

AI 工具的演進速度，讓任何固化的解法都有保鮮期的問題。有些我們在設計時試圖解決的問題，現在模型本身可能已經部分處理了；反過來，也有我們沒預想到的新狀況冒出來。我們把 Agentic OS 定位為一個持續演進的實驗，不是一個收斂的答案。這個系列會把框架的各個機制拆開來談。如果你也在摸索怎麼讓 AI 代理在實際開發工作裡更可控、更可追溯，希望有些地方能對你有參考價值。

下一篇：只用 Prompt 和技能也能做好治理：實用技巧與範例

Agentic OS 是開源專案，歡迎看看我們怎麼實作，也歡迎指出你覺得不對的地方：github.com/KbWen/agentic-os

What Makes an AI Skill Different from a Prompt?

KbWen — Thu, 16 Apr 2026 00:00:00 +0800

TL;DR: A “Skill” in production AI is not a saved prompt — it’s a capability abstraction layer with a defined input schema, tool bindings, validation, and retry logic. This post explains why that distinction matters and how Skills fit between raw model calls and higher-level agent orchestration.

This post is part of a series on building real AI systems. If you haven’t read the previous piece on moving beyond prompts, that’s a good place to start.

Introduction

Most people treat AI skills as glorified prompt templates.

Honestly, that’s fair. A lot of products market “skills” as saved instructions with a nicer UI. You click a button, it loads some text into the system prompt, done.

But in a real production system, a Skill is something different.

It is a capability abstraction layer—sitting between raw model calls and the higher-level orchestration that actually does something useful. If prompts are like making direct API calls, Skills are the first layer of actual software architecture you build on top of them.

That distinction sounds abstract right now. It starts to matter a lot once your system gets complicated.

Prompt Templates Fall Apart Eventually

Prompt templates are fine to start with. Quick to write, easy to understand, and they work.

Until they don’t.

Say you’re building a basic content pipeline:

Research topic
→ Draft outline
→ Write article
→ Review tone
→ Check SEO
→ Generate metadata

Six steps. You write six prompts. Seems manageable.

A few months later, the pain starts showing up in small ways:

You copy-paste prompt logic into a new workflow and tweak it slightly. Then again. Now you have three versions that are almost identical but not quite.
Someone changes the output format for one prompt. Two other steps break silently.
You add a tool integration to one flow. You realize you need the same thing in three other flows and now you’re doing it manually each time.
Retry logic gets bolted onto individual prompts in different ways depending on who wrote them.

None of these problems are catastrophic on their own. Together, they quietly accumulate until you realize you’re maintaining a pile of interconnected prompt strings that nobody fully understands anymore.

That’s the moment when prompts stop being instructions and start being infrastructure—but without any of the structure that makes infrastructure manageable.

This is where Skills become necessary.

A Skill Is a Capability Contract

Here’s a useful way to think about it:

A Skill is a standardized interface around a model capability.

Not around a prompt. Around a capability.

In practice, a properly designed Skill specifies a lot more than the prompt text itself:

name: blog_writer

inputs:
  - topic
  - audience
  - tone

tools:
  - web_search
  - knowledge_base

constraints:
  - markdown_output
  - seo_friendly
  - no_ai_tone

validation:
  - min_word_count
  - heading_structure_check

retry_policy:
  max_attempts: 2

Once you’re writing something like this, you’re not really doing prompt engineering anymore. You’re doing interface design.

The prompt probably still exists somewhere inside the implementation. But from the outside, what you’ve defined is:

Input schema → Capability execution → Validated output

That’s much closer to an API or a microservice than a prompt. And it behaves like one—you can test it independently, version it, swap the implementation without changing the callers.

Why This Actually Matters for Workflows

The main reason Skills are useful is that workflows need stable interfaces to call into.

Without that abstraction, every workflow ends up owning its own prompting logic:

Workflow step 1:  raw prompt A
Workflow step 2:  raw prompt B
Workflow step 3:  raw prompt C

Which means your orchestration logic and your model interaction logic are completely tangled together. Changing one touches the other. Reusing anything requires copy-pasting.

With properly defined Skills, a workflow looks more like this:

research_skill()
summarize_skill()
outline_skill()
writer_skill()

Now the workflow doesn’t care how writer_skill works internally. It just knows what it takes and what it produces. You can improve the skill, swap the model, change the prompt—the workflow doesn’t need to know.

Some other things this unlocks: you can attach logging and metrics at skill boundaries instead of burying them in prompts. You can run evaluations on individual skills. Multiple workflows can call the same skill without duplicating logic. Teams can work on skills independently without stepping on each other’s orchestration code.

It’s modular design. The same reasons it works in software apply here.

You Can Get the Granularity Wrong

Skills aren’t free. Over-abstracting creates its own problems, and it happens a lot when teams first discover this pattern.

If you cut things too fine:

title_generator_skill
intro_writer_skill
paragraph_writer_skill
conclusion_writer_skill

Now you’ve created a pile of tiny skills that don’t do anything useful on their own and require elaborate orchestration to combine. You’ve traded one mess for a different mess.

Go too broad instead:

content_engine_skill

And you’ve got a black box that’s hard to test, hard to improve, and not actually reusable across contexts.

The rule is the same as software modularity has always been: a Skill should represent one reusable capability boundary. Not one prompt. Not an entire workflow. One coherent capability.

Getting that boundary right takes some iteration. That’s normal.

Where Skills Sit in the Stack

A practical architecture for a production AI system usually looks something like this:

Each layer has a distinct job:

Layer	Responsibility
Model	Generate / reason
Tool	External actions / data access
Skill	Encapsulate reusable capability
Workflow	Orchestrate deterministic processes
Agent	Handle dynamic decision-making
Application	User-facing product logic

The reason this layering matters is that a common mistake in AI product design is collapsing everything into the model layer—treating model behavior as the architecture itself. That works for demos. It falls apart as soon as you need to maintain, test, or scale anything.

The model is one layer. A useful one, obviously. But still just one layer.

The Shift from Prompting to Engineering

There’s a specific moment in building AI systems where you stop interacting with models and start designing machine capabilities. It’s not always obvious when it happens, but you usually notice it in retrospect.

Skills are often where that shift shows up concretely.

Before Skills, most of your effort goes into getting the model to do the right thing in a given context. After Skills, you’re thinking about interfaces, contracts, validation, reuse—the same concerns that show up in any software system.

The underlying craft changes. It’s still prompt engineering in places, but the bigger picture is system design.

Conclusion

Prompting is still useful. It’s the basic interaction primitive for working with AI models and it’s not going away.

But prompts alone don’t constitute architecture.

When you need your system to be maintainable, testable, and reusable—when multiple workflows need to share behavior, when you need consistent tooling and validation across the system—you need an abstraction layer above raw prompting.

That’s the Skill layer.

The short version:

Prompts help you use AI.
Skills help you build with AI.

Once you’re building with AI, you’re doing system design. The sooner you design the system intentionally, the less you’ll be untangling it later.

Skills are the capability layer. The layer above — orchestrating agents to do real work — is where governance starts to matter. If you’ve hit unpredictable agent behavior, Why AI Agents Go Wrong: It’s Not the Model covers why most of those failures are structural, not capability, problems.

FAQ

Q: What is an AI Skill?
An AI Skill is a reusable capability abstraction that wraps model behavior with defined input/output schemas, tool access, validation, and retry logic. Unlike a prompt, it exposes a stable interface that other system components can depend on.

Q: What’s the difference between a prompt and a skill?
A prompt is a raw instruction sent to a model. A Skill is a structured capability contract—it defines inputs, tooling, constraints, and validation around that interaction, making it reusable, testable, and composable within larger systems.

Q: When should I use Skills instead of prompts?
When you need the same behavior across multiple workflows, when you want to test or evaluate model behavior independently, or when you need consistent tooling and retry logic across your system. If you’re using a prompt in more than one place, it’s probably worth wrapping it into a Skill.

Q: How do Skills relate to Agents and Workflows?
Skills sit between the tool layer and the workflow layer. Workflows orchestrate sequences of Skill calls. Agents add dynamic decision-making on top of that—choosing which Skills to invoke based on context. Skills are the stable building blocks both depend on.

只會 Prompt 已經不夠了：從「下指令」到「蓋系統」的思維進化

KbWen — Wed, 01 Apr 2026 00:00:00 +0800

TL;DR： 只靠 Prompt 是在做手工藝，不是蓋系統。這篇文章拆解 Prompt → Skill → Workflow → Agent → System 五個層級，用「寫技術文章」的完整案例，說明每個層級解決什麼問題、為什麼上一層不夠用。

前言：別在 Prompt 的死胡同裡打轉

現在只要打開社群媒體，滿地都是「最強 Prompt 指令集」或「這 10 個指令讓 AI 變神級工具」。

剛接觸 AI 的時候，我也沉迷過這種「咒語」的力量。但實戰幾次後你會發現，如果你還在糾結如何微調 Prompt 的那幾個形容詞，那你其實還是在做「手工藝」。這種方式產出的結果不穩定、無法規模化，更重要的是，它非常耗神。

如果你有留意近一兩年的技術演進，你會發現真正的高手已經不再討論怎麼寫咒語了。大家在聊的是 Workflow（工作流）、Agent（代理人） 以及 System（系統）。

這篇文章我想從一個資深開發者與 PM 的視角，拿一個最簡單的任務——**「寫一篇高品質的技術文章」**做案例，帶你看這幾個層級的思維斷層在哪裡。

1. Prompt 層：一次性的「介面」溝通，也是體力活

這是最基礎的使用方式，你打開 ChatGPT，輸入一段話：

「請幫我寫一篇關於 AI Workflow 的技術文章，包含架構說明與範例。」

這就是 Prompt 層。雖然它很強大，但本質上它只是在「調用模型」。它的限制顯而易見：

抽盲盒效應： 成果好壞全看運氣。今天給你 80 分，明天可能只剩 60 分。
孤島式作業： AI 沒辦法讀取你電腦裡的其他資料，也不懂你的審美標準，它只活在那一次的對話框裡。
認知負荷高： 每次遇到新文章，你都要重新把需求、背景、限制條件再描述一遍。

老實說，Prompt 只是個「對話介面」，而非一個「系統」。 它適合處理小型、零碎、一次性的任務（像是修一個小 Bug 或翻譯短句）。但如果你想靠它穩定產出專業內容，那只是在用 AI 換另一種形式的「體力活」罷了。

2. Skill 層：把「手感」封裝成「能力模組」

當你對 AI 寫作有了點心得，你會發現有些要求是重疊的。這時候，你會開始定義一套固定的「寫作標準」。這在開發者眼中，就是所謂的封裝（Encapsulation）。

例如，你會建立一個「技術寫手」的能力：

預設框架： 必須包含摘要、關鍵技術細節、範例程式碼與結論。
統一風格： 語氣要像資深工程師在分享，不准用太過浮誇的形容詞。
工具聯動： 在動筆前，AI 必須先去搜尋網路上的最新趨勢，並讀取你指定的筆記文件。

當你把這些「咒語」固定下來，並賦予它特定的工具存取權，你就建立了一個 Skill（技能）。在這一層，你的角色從「打字員」變成了「教練」。你不再是給一段話，而是定義一套「標準」。

（Skill 跟 Prompt 的差別不只是「存起來的指令」——它其實是一層有 input schema、工具綁定和驗證邏輯的能力契約。這個區別我在 What Makes an AI Skill Different from a Prompt? 裡拆得更細。）

3. Workflow 層：串接能力的「自動化流水線」

但問題來了，寫出一篇能拿得出手的技術文章，真的只靠一個「寫手技能」就能搞定嗎？

答案是不行。一篇好文章需要：

Research： 廣泛挖掘資料，過濾雜訊。
Structuring： 設計邏輯嚴密的架構，而不是想到哪寫到哪。
Drafting： 基於研究資料與架構進行撰寫。
Editing： 檢查文法、修辭、是否有技術錯誤。
Polishing： 生成吸引人的配圖與 SEO 優化。

這是一條生產流水線（Pipeline）。在 Workflow 層級，你會設計一個流程，讓不同的 Skill 在對的時間點出場：

搜尋資料 → 摘要整理 → 產出大綱 → 分段撰寫 → 交叉校對 → 格式化

這就是我們在軟體開發中常說的任務調度（Orchestration）。當你開始思考 Workflow，你才算真正踏入 AI 自動化的門檻。

4. Agent 層：把「決策權」交給 AI 的代理人模式

Workflow 與 Agent 最大的斷層在於：誰來決定下一步該做什麼？

在 Workflow 中，步驟是人定死的（如果步驟 A 失敗就停下）。但在 Agent（代理人） 模式下，你給的是「目標（Goal）」而非「步驟（Step）」：

「目標：針對 AI Agent 這個主題，產出三篇深度評測，並確認內容與目前市場競爭者不重複。」

這時候，Agent 就會開始它的 Loop（循環）：它會先搜尋，發現資料不足後「自主決定」再去多找幾個來源；它寫完大綱後會自我審核，覺得不夠好就「退回重寫」。

這已經不是你在操作 AI，而是 AI 在操作工具與流程。它具備了一定的自省（Reflection）與決策（Reasoning）能力。

5. System 層：建立一個會自我成長的「數位大腦」

這是目前的終極階段。當你有許多個 Agent 在協作，且背後有統一的知識庫（RAG）、一致的標準與自動化基礎設施時，你就建立了一個 System。

在這個層級，你不再是為了寫一篇文章而努力。你是設計了一個「能自動跟進技術趨勢、自動產出內容、甚至能根據讀者反饋自我優化」的內容工廠。

層級	核心意義	你的真實角色
Prompt	單點互動	數位工具的使用者
Skill	能力模組化	規範標準的教練
Workflow	流程自動化	工廠流水線的設計師
Agent	目標導向決策	指揮若定的系統架構師
System	完整生態運行	數位資產的營運者

結語：停止競爭指令，開始設計系統

未來的差距，不會在於誰記住了更多的 Prompt 語法。隨著模型越來越聰明，對於指令的精確度要求反而會下降。

真正的差距在於：誰能把瑣碎的經驗「封裝」成技能，把混亂的流程「優化」成工作流，最後把這些組合成一個能自動運行的系統。

當 AI 從「聊天的工具」進化成「工作的系統」，我們作為人的價值，就不再是去補足 AI 指令的不足，而是去定義問題、設定目標，並設計出那一套幫我們達成目標的精妙系統。

當你真的開始用 AI 代理做實際開發，「結構不夠」的代價會變得很具體。延伸閱讀：AI 代理常見痛點與我們的嘗試整理了五個反覆出現的問題；只用 Prompt 和技能，也能做到基本治理則是不裝框架就能上手的最低成本做法。

Token 是什麼？LLM 為何只讀 Token？

KbWen — Mon, 01 Dec 2025 16:24:54 +0800

TL;DR： Token 是 LLM 的最小處理單位——不是完整的字，也不是單一字元，而是介於兩者之間的「子詞」。這篇文章解釋三種斷詞方法（字級、字元級、BPE）、為什麼 LLM 不直接讀整個詞，以及 token 數量如何影響計費和 context 長度。

前言

上篇講到LLM，這片就來說說裡面很常提到的字「Token」。Token 是語言模型可理解的最小單位，它像積木一樣把長句拆成小塊，讓模型逐一處理。這篇文章用更平易近人的方式解釋什麼是 token、為何 LLM 不直接處理完整的字詞，以及常見的斷詞方法，幫助你輕鬆掌握這個看似陌生卻無所不在的概念。

Token 是什麼？為何要用它？

LLM 是數學模型，必須把文字轉成向量才能運算。最簡單的做法是把每個單詞賦予一個向量，但這樣會遇到兩個問題：

無法處理新詞或拼錯字：如果訓練時沒有見過某個單字，模型就不知道如何表示它。
忽略語素結構：許多語言中，一個詞可以拆成詞根和詞綴，例如「running」「runner」都來自「run」。

為了兼顧彈性與效率，LLM 會先把輸入拆解成更小的 token。有人將 token 定義是「字、字元或包含標點的組合」。有些文中也強調，token 是模型用來處理文字的原子單位。透過 token，模型得以把複雜的語言拆成固定大小的向量，並對每個 token 指派唯一編號。

幾種常見的斷詞方法

不同 LLM 可能採用不同的分割策略。以下三種是最常見的斷詞方法：

字級（Word）：按空格切割。例如 “unbelievable performance” 被當作兩個 token。優點是數量少，但遇到新詞就無法處理。
字元級（Character）：每個字母和空白都是一個 token。它能處理任何輸入，但 token 數大幅增加，效率低下。
子詞級（Subword）：介於上述兩者之間，把常見詞根或片段視為 token，是現在主流 LLM 的做法。例如 “unbelievable performance” 可以拆成 ["un", "bel", "iev", "able", "per", "form", "ance"]。

圖中展示同一句話經過三種方法切分後的樣子：

把詞拆成小塊，看出不同斷詞方式產生的 token 數量差異。

簡易 Python 範例：手寫子詞切分

以下程式碼示範如何使用簡單的片段詞表（模擬 BPE 結果）把長詞拆成 token。一樣，雖然不是完整的演算法，但能幫你理解 tokenization 的動作。

# 定義一組常見片段
subwords = ["un", "bel", "iev", "able", "per", "form", "ance"]

# 簡易子詞切分函式
def tokenize_subwords(text, subwords):
    tokens = []
    i = 0
    while i < len(text):
        match = None
        for sw in sorted(subwords, key=len, reverse=True):
            if text[i:].startswith(sw):
                match = sw
                break
        if match:
            tokens.append(match)
            i += len(match)
        else:
            tokens.append(text[i])
            i += 1
    return tokens

# 輸入與輸出示範
print(tokenize_subwords("unbelievable performance".replace(" ", ""), subwords))
# 可能輸出: ['un', 'bel', 'iev', 'able', 'per', 'form', 'ance']

每次優先匹配片段詞表中最長的項目，若無匹配則輸出單個字母，呈現出子詞分割的概念。

注意事項

上下文長度有限：LLM 的輸入與輸出 token 有固定上限。如果採用字元分割，同樣一段話會產生更多 token，導致可用的輸出長度變短。
不同模型斷詞規則不同：GPT-3、GPT-4 可能用 BPE，其他模型可能用 WordPiece 或 SentencePiece；結果不同會影響 token 數量與成本。
計價與速率限制：許多雲端服務依 token 數量計費，並對每分鐘 token 數有上限。了解如何計算 token 有助於預估使用成本。（這個成本問題在 AI 代理上會放大——一個複雜任務的治理開銷可能達五萬個 token。實際量測數字見 AI 代理常見痛點與我們的嘗試。）

實際應用

在實際應用中，tokenization 是許多 NLP 任務的基礎。以下列出幾個典型場景：

文字理解與問答：斷詞讓 LLM 能夠理解句子中詞語的關係，支援資訊檢索與問答系統。
文本分類與分析：經過 tokenization 的文本可以餵給分類模型用於垃圾郵件偵測、情緒分析或主題分類。
機器翻譯：子詞斷詞能將罕見或複雜的單字拆分，協助模型在源語言與目標語言之間轉換。
命名實體辨識：將句子拆成 token 是辨識人名、日期、地點等實體的前置步驟。
摘要與聊天機器人：Tokenization 讓模型把長文件拆成可管理的片段進行摘要，也能在對話場景中快速理解使用者輸入。

結語

Token 是大語言模型能理解和輸出的基本單位。透過適當的 tokenization，模型能在見識有限的情況下理解新詞、處理不同語言並產生連貫的文字。熟悉斷詞方法的差異與限制，能幫助我們更有效率地與 LLM 互動並控制成本。希望這篇文章用更簡潔的方式讓你理解 token 的本質，也幫你在未來使用 AI 時更有概念。

《大語言模型 LLM：其實做的事情比你想像中更單純》

KbWen — Sun, 23 Nov 2025 20:54:42 +0800

TL;DR： LLM 只做一件事——預測下一個 token。這篇文章從這個核心概念出發，解析 Transformer 自注意力機制、四步驟訓練流程，以及為什麼「這麼簡單的事」能演變成看起來像魔法的語言能力。

前言 Introduction

如果你最近有用過 ChatGPT、Claude、Gemini，你已經在跟 LLM（Large Language Model）聊天了。這些模型看起來像懂很多、會推理、甚至比朋友還健談，但它們的核心動作其實無比樸實：預測下一個字。聽起來太簡單？沒錯，但模型規模一大、資料一多、演算法一調整，這個「下一字遊戲」就能演變成看起來像魔法的語言能力。這篇文章會用工程師看得順、初學者不會暈的方式，把 LLM 的概念、原理與常見應用一次講清楚。

LLM 是什麼？

LLM 的任務比你想像的還簡單

從理論上看，LLM 是一種深度學習模型，被訓練去完成一件事情：

在語境下，挑選「最可能出現的下一個 token」。

token 可以是中文字、英文單字的一部分、符號、甚至數字。當模型知道怎麼選下一個 token，然後不停重複這件事，就能組出一整段看起來像人寫的句子。

為什麼它看起來「懂很多」？

因為它被餵了大量內容：百科、文章、科技文、論壇討論…… 在海量語料裡找模式後，它自然會「講得像很懂」。我們的感官上就感覺它懂很多、很能理解。

圖 1：LLM 下一字預測核心概念示意圖

LLM 是怎麼「學會」語言的？

LLM 的學習流程大致分成四個步驟，其實蠻務實的：

1. 收集大量文本（資料越多，模型越穩）

來源包含書籍、文章、程式碼、論壇、維基百科等。資料不是越亂越好，但越多越有機會讀懂語言中的隱性規律。

2. 分詞（Tokenization）

模型不直接處理字，而是處理 token。你可以把它想像成：「把一個蛋糕切成很多比較好吞的碎片」。

3. 預測下一個 token（核心任務）

模型會計算所有候選 token 的機率：

哪個最可能？
哪個跟前文最適合？
哪個不太會讓模型出糗？

機率最高者 → 輸出。

4. 誤差反向調整（Backpropagation）

預測錯了？ → 重新調參 → 再預測 → 再調 → 重複幾十億次

這就是 LLM 的訓練人生。

圖 2：Transformer 自注意力（Self-Attention）概念圖

為什麼 LLM 比舊模型聰明？（Transformer 的魔法）

重點只有一句：

Transformer 讓模型能「一次理解整段內容」，而不是從左讀到右。

這也是它比 RNN/LSTM 更成功的原因。 Attention 讓模型可以決定「這句話裡誰比較重要」。例如：

「我昨天在百貨公司遇到他的媽媽。」

模型要知道「他的」指誰？ Attention 就是用來處理這種依存關係。

LLM 的常見應用

1. 對話生成

你跟 ChatGPT 打字，它回你話，就是 LLM。

2. 自動摘要

幾千字的文章，壓成幾句話。

3. 程式碼生成

模型讀你的描述 → 補出 Python、JS、C++ 等程式碼。

4. 文案／腳本創作

IG 內容、企劃書、劇本、商業信……全都能做。

5. 搜尋強化（Retrieval）

結合資料庫讓模型能「查」不是「猜」。

小示範：一個極簡的「下一字預測」模型（Python）

這不是 LLM，但有助理解它在做的事情。

程式碼用途

示範「根據前文，以最常見模式預測下一個詞」。

import random

# 超簡化版「下一詞預測」
def predict_next_word(text):
    dataset = {
        "我想": ["吃飯", "睡覺", "休息一下"],
        "今天": ["天氣", "工作", "進度"],
        "AI": ["模型", "應用", "技術"]
    }
    key = text[-2:]  # 抓最後兩字
    return random.choice(dataset.get(key, ["（不知道下一個字）"]))

print(predict_next_word("我想"))

輸出可能結果

吃飯
睡覺
休息一下

行為解釋

當然，這個模型沒有「理解」。它只是根據常見搭配「猜」下一個字。 LLM 就是把這件事做到極致。

圖 3：LLM 訓練流程 4 步驟示意圖

LLM 的限制

它不懂世界，只懂資料裡的統計規律。
有時會亂講（Hallucination）。
語氣、觀點會受到訓練資料影響。
對日期、最新事件通常不可靠。
無法真正推理，只是推測看起來像推理的下一段內容。

結語

LLM 看起來強大，但核心原理其實異常單純：預測下一個可能的 token。借助 Transformer、海量資料與超大算力，小小的「下一字遊戲」才能長成今天這個會聊天、會寫程式、會創作的模型。理解這個本質後，你會更理性地評估它的能力，也更知道如何善用它——而不是被它的「看起來很像懂很多」給騙了。理解了「LLM 本質上是機率預測」之後，下一個問題是：當你把這種非決定性的模型放進實際開發流程，會發生什麼？這正是 AI 代理常見痛點與我們的嘗試想回答的。

AI Systems on KbWen Blog

MCP 資安危機：問題不在協定，而在治理

先講清楚：MCP 為什麼會贏

然後資安研究員開始拆它

關鍵爭議：「設計如此」算不算卸責

我的看法：這不是協定的 bug，是治理的缺口

那正在補的東西，其實就是治理

給實作者的幾條原則

最後

延伸閱讀

MCP Security Isn't a Protocol Bug. It's a Governance Problem.

First, give MCP its due

Then the security researchers took it apart

The real argument: is “by design” a cop-out?

My take: this is a governance gap, not a protocol bug

The patches landing now are just governance

A few rules if you’re shipping on MCP

The takeaway

Read next

Skill 邊界設計:從能力到合約

能力清單 vs 合約

合約先行:先想清楚它不該做什麼

邊界鬆掉,其實是一次沒講的破壞性變更

我把一個 skill 從能力改成合約的前後

還在摸索

延伸閱讀

Skill Design as Interface Design

A skill is a contract, not a capability list

The principles that transfer

Scope drift is a breaking change

The boundary has a token price too

Token 成本的真相:分級,但別分太細

Token 成本是開始任務前就決定的

粗放的成本沒有上限

我原本以為切細一點就省,結果不是

我實際上怎麼分級

還沒定論

延伸閱讀

Token Economics of AI Agent Governance

Token cost is a design variable, not an afterthought

Bounded cost versus unbounded cost

Granularity has a cost too: caching changes the math

The SLA/SLO parallel

No evidence, no completion

What “evidence” means here

Why one rule covers so much

Before the task, not after

What good evidence looks like

The cost of specifying evidence

The question to ask before your next task

Read next

Work Log：跨 session 的記憶機制

兩層記憶，兩個問題

Work Log 長什麼樣子

哪些東西值得記

這個方法的真實限制

如果你想試試看

Prior art: what distributed systems already knows

What agent execution looks like from the outside

The pattern mapping

The one place the analogy breaks

Where to instrument

Which gaps cost the most

只用 Prompt 和技能，也能做到基本治理

記憶檔案：解決跨 session 失憶的最低成本方案

Skill 選擇：要求愈具體，干擾愈少

Evidence 習慣：不問則不說

範圍宣告：先說不要碰什麼

這個層面的治理做到什麼、做不到什麼

延伸閱讀

Why AI Agents Go Wrong: It's Not the Model

Two failure modes that look the same

The five structural gaps

Engineering already solved these problems

What governance actually costs

The question to ask before the task starts

Read next

AI 代理常見痛點與我們的嘗試

幾個反覆出現的痛點

輸出難以核查