Claude Code on KbWen Blog

No evidence, no completion

KbWen — Fri, 22 May 2026 20:00:00 +0800

TL;DR: “No evidence, no completion” is a single structural principle: a task isn’t done until the agent produces an artifact that exists outside the conversation and can be checked independently. It sounds trivial. In practice it closes most of the common agent failure modes in one rule, because the act of specifying what evidence looks like, before the task runs, forces you to define what “done” actually means.

In the previous post in this series I described an agent that said a feature was done (commit SHA requested, none existed, two of three modules unchanged). The failure had a name: no external completion criterion existed, so the agent supplied its own. That gap has a one-rule fix.

What “evidence” means here

Evidence is any artifact that exists outside the conversation and can be verified independently of what the agent said.

A commit SHA is evidence. A test output is evidence. A file path with a checksum is evidence. A screenshot of a passing CI run is evidence.

“I implemented it” is not evidence. “The feature is working” is not evidence. A description of what the agent did is not evidence: it’s the agent’s own assessment of its work, which is exactly what you’re trying to verify.

The distinction matters because conversation text is not auditable. It exists only within the session, can’t be pointed to by anyone who wasn’t there, and doesn’t prove the underlying state of the system. An artifact external to the conversation can be checked at any time, by anyone, against the actual state.

Why one rule covers so much

The first post in this series catalogued five structural gaps: no completion criterion, no phase gate, no state handoff, no resource scoping, no capability boundary. The evidence principle doesn’t replace all of them, but it forces the most important one: you cannot specify what evidence looks like without first deciding what “done” means.

If the evidence for a feature task is “passing tests + commit SHA on the feature branch,” you’ve implicitly defined the completion criterion, the scope boundary (the feature branch, not the main codebase), and a checkpoint for the phase gate. The evidence requirement is the handle that pulls the rest of the structure into place.

This is why the distributed systems framing maps so cleanly: delivery acknowledgment in a message queue is exactly this pattern. The queue doesn’t trust the worker’s internal state; it requires an external signal that the job completed. Decades of production systems run on that principle because systems without it fail in the same predictable way.

Before the task, not after

The principle works when it’s applied before the task starts, not as a review step after.

“What would prove this is done?” asked before the work begins forces a design decision. It’s not a check on the agent — it’s a check on the task specification. If you can’t answer it, the task isn’t specified well enough to run. If you can answer it but the answer is vague (“the feature works”), the vagueness is in your specification, not in the agent’s execution.

This is the mechanism Pete Hodgson’s analysis of AI coding tools points toward: when a problem has many valid solutions, the agent will pick one. That one will probably be valid. It probably won’t be the one you wanted. Specifying evidence before the task runs is a way of narrowing the solution space — the agent’s output has to satisfy the evidence criterion, which eliminates the paths that don’t.

In practice: “implement email verification” with no evidence criterion produces one kind of output. “Implement email verification — done when: (1) tests pass for OTP generation and expiry, (2) commit SHA on feat/email-verification” produces a different one. Same model. Different structure around it.

What good evidence looks like

Evidence should be:

External to the conversation. It can be retrieved or verified by someone who wasn’t in the session. A commit SHA can be looked up. A test output can be reproduced. A URL can be visited.

Specific enough to be falsifiable. “Tests pass” is weaker than “running npm test returns exit 0 with 47 tests passing.” The second can be false in a way that “tests pass” can’t — which is the point. If the evidence criterion can’t be falsified, it’s not doing the work.

Proportional to the task. A one-line bug fix doesn’t need a full audit trail. The evidence for a tiny fix is the commit SHA and a grep confirming the old string is gone. The evidence for a feature touching auth, API, and database schema is more involved: test output, migration SHA, API contract diff. The Agentic OS framework classifies tasks before they run partly to route to the appropriate evidence format: a quick-win task and an architecture-change task need different levels of proof.

The cost of specifying evidence

Specifying evidence costs something up front. It takes maybe two minutes to think through “what would prove this is done” before a task starts. That’s real overhead.

The comparison is with recovery cost. A governance failure (completing a task that didn’t actually complete, or completing it the wrong way) typically costs: discovering the error, rebuilding context, rerunning the work, and auditing scope. None of those costs are bounded. The two minutes up front is.

The Agentic OS v1.1 benchmark (April 2026, using chars/4 as the token estimation formula, ±10%) measured governance overhead for a quick-win task at roughly 17,000 tokens: the cost of the full structured lifecycle, evidence requirement included. For a complex feature spanning API design, auth, and database schema, it’s around 51,000 tokens. Those numbers are real costs. They’re also the ceiling. The cost of an undetected wrong completion has no ceiling — it depends on when you find it and how much work built on top of it.

The question to ask before your next task

Before you give an agent its next task: what artifact would prove this is done?

Not “what would it mean to be done” — that’s vague enough for the agent to fill in. What specific artifact, external to the conversation, would you point to afterward and say: here is the evidence this completed correctly.

If you have an answer, you have a completion criterion. If you don’t, you’re delegating the definition of “done” to the agent. It will define one. It almost never matches yours.

This post is part of a series on building real AI systems. The previous posts cover the two-failure taxonomy and the distributed systems prior art that motivates the evidence requirement. The framework is open source at github.com/KbWen/agentic-os.

Work Log：跨 session 的記憶機制

KbWen — Fri, 22 May 2026 18:00:00 +0800

TL;DR： Work Log 是一個很無聊的東西：一份 markdown 檔案，記錄這個任務做到哪裡、做了哪些決定、下個 session 要從哪裡繼續。它沒有解決 AI 的記憶問題，只是繞過它。但在我們找到更好的方法之前，它有效。

在那篇談治理基礎的文章裡，我說過 AI「只活在那一次的對話框裡」。這個說法的代價，在你真正開始用 AI 代理做持續開發的時候才會變得具體。

你跟 AI 花了一個小時討論這個 feature 要走哪個設計模式、為什麼不用另一個方案、資料庫的 schema 要怎麼調整。全部討論清楚了，開始實作。隔天開新對話，繼續做。AI 從頭來：哪個設計模式？資料庫？我不知道你說的是什麼。

這不是 Claude 的問題，也不是任何特定模型的問題。它就是這樣運作的。上一篇說到記憶檔案（CLAUDE.md、AGENTS.md）可以幫助 AI 記住專案的架構規則。但那解決的是「規則要記住」的問題，不是「這個任務做到哪裡」的問題。Work Log 是後者。

兩層記憶，兩個問題

先說清楚兩個東西的差別，因為我發現自己最初混在一起想。

專案記憶：這個專案的架構是什麼、用了哪些 ADR、活躍的任務清單在哪裡、哪些 skill 可以用。這是全域的、靜態的，跟任何一個具體任務無關。你不常去動它，但每次開新 session，AI 需要讀它才知道自己在什麼脈絡裡。

任務記憶（Work Log）：這個任務做到哪個 phase、做了哪些決定、下一個 session 要從哪裡繼續。它是動態的、per-task 的。一個任務一個檔案，在 Agentic OS 裡放在 .agentcortex/context/work/.md（完整結構見 repo）。

混在一起的後果是：要麼全域狀態被塞滿具體任務細節（之後沒人看得懂），要麼任務進度沒地方記（每次都從頭）。分開之後，兩個問題各有各的解。

Work Log 長什麼樣子

下面是一個簡化版的實際樣子（來自 github.com/KbWen/agentic-os）：

# Work Log: feat/email-verification

## Header
- Branch: feat/email-verification
- Classification: feature
- Current Phase: implement
- Checkpoint SHA: a3f9c12

## Task Description
新增 email OTP 驗證流程。使用者第一次登入後需完成驗證，
未驗證帳號只能讀取，不能寫入。

## Phase Sequence
| Phase     | Status      | Notes                    |
|-----------|-------------|--------------------------|
| bootstrap | completed   | 分類為 feature           |
| plan      | completed   | 確認走 OTP 不走 magic link |
| implement | in-progress | auth module 完成，email 發送待測 |
| review    | pending     |                          |

## Gate Evidence
- Gate: plan | Verdict: pass | At: 2026-05-10T14:00Z
- Gate: implement | Verdict: FAIL | Reason: email sending untested, scope not complete | At: 2026-05-11T09:00Z

## Phase Summary
- plan: 討論了 OTP vs magic link。決定用 OTP，因為我們的 email
  provider 有速率限制，magic link 的 retry 設計複雜度更高。
  這個決定要記住，下個 session 不要再討論。

關鍵不在格式，在那個 Phase Summary。每個完成的 phase，AI 要用一段話說：做了什麼決定、為什麼這樣決定、有什麼取捨。

這段話的作用不是給人讀的，是給下一個 session 的 AI 讀的。這個差別值得注意：給人讀的語言習慣加很多語境鋪墊，給 AI 讀的語言要決策密度高、歧義少。同樣一個決定，給人讀可能寫「考量效率後決定用 OTP」，給 AI 讀更好的形式是「用 OTP，不用 magic link，原因：provider rate limit + magic link retry 複雜度高，此決定封存，不再重新討論」。後者直接進入 AI 的 context，前者需要它自己推斷。

新對話開始，AI 讀了這份 Work Log，知道「OTP vs magic link 已經決定過了，不用再想」。它不會再建議你改用 magic link，因為那個決策已經被記錄並封存。

哪些東西值得記

不是所有事情都要寫進 Work Log。先說限制：一份幾百行的 Work Log，AI 讀完之後注意力也稀釋了。我們設定的上限是每個 Phase Summary 一段話，不超過五句。有這個上限，才值得想清楚什麼東西最值得占那個位置。

從我的觀察來看，排最前面的是這幾樣：

決策，尤其是否定的決策。 你決定不做某件事的原因，比你決定做某件事更容易被遺忘。「我們用 OTP 不用 magic link，因為 rate limit 問題」——如果沒記，下個 session 的 AI 大概又會建議 magic link。

當前 phase 的狀態。 做到哪一步、什麼東西是完成的、什麼還沒做。這讓新 session 可以從中間接續，不是從頭。

還有一類東西最容易被忽略：你在 implement phase 發現了一個沒有答案的問題。不要默默繼續，也不要讓 AI 自己想辦法繞過去。寫進去，下個 session 一開始就正面面對它。這類問題如果沒記，AI 下次遇到同一個岔路，十之八九走錯方向——不是因為它笨，是因為它不知道你已經知道那條路走不通。

這個方法的真實限制

說清楚它做不到的事，比說它能做什麼更重要。

Work Log 解決的是「把決策外化」的問題。AI 把決策寫在外部，下次讀回來，行為才一致。它沒有解決 AI 的狀態記憶問題，因為那個問題的解決需要模型架構層面的改變。Work Log 只是個繞路方案：既然 context 不能跨 session 存活，我們就把最重要的東西寫成文件，在 session 開始時重新注入。

這個繞路有個天花板。任務夠複雜的時候，Work Log 本身也會膨脹。你開始注意到你在寫一份文件，讓 AI 讀這份文件，再根據它繼續工作——而不是直接繼續工作。整個過程變得笨重。

還有一個現在可能不用擔心、但以後要注意的事：prompt cache 機制。Claude 和其他主流模型都有 prompt cache，在同一個 session 內重用相同 context 的成本很低（以 Claude 為例，cache TTL 大約是 5 分鐘到 1 小時）。如果你的任務可以在一個 session 裡完成，Work Log 的 ROI 其實有限——cache 幫你保住了 context，不用依賴外部記錄。Work Log 真正發揮的地方，是跨越多個 session 的任務，也就是 cache 早就失效的那種。

我們把 Agentic OS 的 Work Log 定位為一個夠用的暫時解，不是最終答案。AI 工具的 native memory 機制在快速發展，現在適合加 Work Log 的任務類型，一兩年後可能模型自己就能處理。這個觀察在整個系列的第一篇裡也說過：任何固化的解法都有保鮮期。

如果你想試試看

最輕量的開始：開一個任務，在任務開始前新增一個 markdown 檔案。三個區塊就夠：任務目標（一句話）、已決定的事（每次做了決定就加一行）、目前停在哪裡（每次結束 session 更新）。

不需要完整的 Work Log 格式。這三個區塊能擋掉大部分「下個 session 從頭來」的問題，原因很簡單——它逼你在 session 結束之前把當前狀態說清楚，而不是留給下一個 AI 自己猜。做複雜了再引入完整的 phase 結構，不用一開始就全部上。

Agentic OS 完整的 Work Log 模板在這裡，包含 phase 定義、gate evidence 格式和 handoff 結構。如果你每次開新對話的前 10 到 15 分鐘都在重新交代背景，而不是做事，那就是加 Work Log 的時機了。

這篇是 Agentic OS 系列的一部分。相關閱讀：只用 Prompt 和技能，也能做到基本治理說的是更輕量的做法，Work Log 是在那個基礎上加一層。AI 代理常見痛點與我們的嘗試是這個系列的入口。

只用 Prompt 和技能，也能做到基本治理

KbWen — Fri, 22 May 2026 14:00:00 +0800

TL;DR： 在裝任何框架之前，有一層治理是免費的：在專案根目錄放一個 AGENTS.md 或 CLAUDE.md，養成開口要求 evidence 的習慣，開始任務前先說清楚什麼不能動。這三件事不能替代跨 session 的狀態管理，但能擋掉大部分常見問題。這篇說的就是怎麼做、做到什麼程度、在哪裡會失效。

有一段時間我的 Claude Code 工作流裡沒有任何框架，只有對話和一堆臨時 prompt。某天我做了兩個改變：把專案的架構決策寫進一個 CLAUDE.md，還有在每次 AI 說「好了」的時候問一句「commit SHA 是什麼？」

一類問題幾乎消失了：AI 在新 session 裡對著不存在的設計模式寫程式碼的情況，以及我接受了「完成」卻發現什麼都沒變的情況。不是所有問題都解決了。但那兩件事的性價比，讓我後來開始認真想「在裝框架之前，這個層面的治理到底能做多少」。

這篇是AI 代理常見痛點與我們的嘗試的延伸。那篇列了五個反覆出現的問題，這篇專門回答：只靠 prompt 習慣和 skill 選擇，能解決多少？

記憶檔案：解決跨 session 失憶的最低成本方案

AI 代理在每一個新對話都是空白狀態。它不記得上次的架構決策，不記得你說過不要用哪個 pattern，也不記得你已經有一個 utils/auth.ts，所以它再寫一個新的。這個問題在 IEEE Spectrum 的報導裡有量測數據：長 session 後期，AI 重複生成已存在函式、忽視早期建立的 coding convention 的頻率明顯上升。

三個工具在試圖解決同一個問題：

AGENTS.md 是 OpenAI Codex 最初設計的慣例，後來被 Cursor、GitHub Copilot 和 Google Antigravity 等主流工具廣泛採納。它的設計邏輯是：在任何工具讀取它之前，先告訴工具「這個專案是怎麼運作的、你可以做什麼、不可以做什麼」。

CLAUDE.md 是 Anthropic 針對 Claude Code 的版本。Claude Code 在每個新 session 開始時自動注入這個檔案的內容，所以你放在這裡的東西就等於是每次都在對話開頭重新說一遍。

.cursor/rules 是 Cursor 的對應物。原理相同。

這三個慣例同時存在，說明「怎麼讓 AI 記住專案規則」這個問題是通用的，不是某個工具特有的。選哪個取決於你主要用什麼工具，你不需要三個都放，放一個就有效果。

這類記憶檔案最有用的內容通常是三類：架構限制（「這個 repo 用 Repository pattern，不要把業務邏輯寫進 controller」）、命名規範（「service 命名用 XxxService，不要用 XxxManager」）、以及「不要碰」清單（「/database/migrations 只有在明確被要求的時候才能動」）。

一個重要的注意：這類檔案要短。研究觀察和實踐都指向同一個上限：200 行、2000 token 以內。超過這個長度，重要的規則會被稀釋。AI 技術上還是讀了整個檔案，但前面讀到的東西到後面已經注意力不足。寫 CLAUDE.md 的時候，如果你覺得需要加第六條規則，先問自己第一條能不能刪掉。

Skill 選擇：要求愈具體，干擾愈少

在 Claude Code 或 Cursor 的一次工作 session 裡，你可以載入很多 context：整個 codebase 的 README、過去的對話歷史、多個技能文件。但「載入愈多愈好」是個陷阱。

一個改一行 typo 的任務，不需要知道整套測試策略、部署規範和 API 設計原則。把這些全部塞進 context，不會讓 AI 更謹慎，只會讓它在「哪些規則現在適用」這件事上分配更少的注意力給真正重要的那個。

這不是 Agentic OS 特有的問題，是任何 Claude Code 或 Cursor session 都存在的情況。具體做法是：開始一個任務之前，先想清楚這個任務需要知道什麼，然後只提供那些。一個 tiny-fix 說「這是那行 code，幫我修」就夠了；一個涉及多個模組的功能開發才需要交代設計模式、測試策略和資料庫規範。

結果是給了 AI 密度更高的相關資訊，不是更少的資訊。

Evidence 習慣：不問則不說

這是成本最低的一個改變，也是讓我最驚訝的一個。

AI 說「完成了」的時候，它有可能真的完成了，也有可能完成了 90% 然後遇到小問題就繞過去了，也有可能整個理解方向就錯了。這三種情況在它的輸出裡，有時候看起來幾乎一樣。

養成一個習慣：在接受任何「完成」之前，要求一個具體的 artifact。

不是一個表單，也不是一套流程，就一句話：「commit SHA 是什麼？」「把 test 跑一遍，貼輸出給我」「你改了哪個檔案，第幾行？」

這個習慣有效的原因不只是讓你可以查。問這個問題本身會讓 AI 把它沒說清楚的地方說出來。 很多時候，我問「測試有過嗎」，它才會說「啊，那個測試我還沒跑，因為 X 的 setup 有問題」，而這個資訊如果我沒問，它可能就默默略過了。

誠實地說：這個習慣很累。問了十幾次之後你開始理解為什麼人們想要自動化這件事，框架裡的 evidence gate 就是把這個問答自動執行。但作為一個習慣，它能擋掉大概六七成的「接受了看起來完成的東西、後來發現沒有」的情況。

範圍宣告：先說不要碰什麼

開始一個複雜任務之前，明確告訴 AI 它應該不要碰什麼。

具體的說法比模糊的說法有效：「你在做 authentication module。除非我明確說，不要碰 /api/payments 和 /database/migrations 底下的任何東西」比「專注在 auth 就好」有用得多。

原因不完全是「AI 會遵守」，它不是每次都遵守。而是宣告了邊界之後，AI 在不確定的時候開始問問題而不是自己決定。我給了這樣的指令之後，在它原本會直接去改 payments module 的地方，它變成問我「這邊需要我更新 payment 的驗證邏輯嗎？」——這個轉變很有價值。

這個觀察跟 Pete Hodgson 對 AI coding assistant 失效模式的分析有直接的關係：當一個問題存在很多可能的解法，AI 選中你心目中那個的機率趨近於零。把解法空間縮小（也包括把「不能碰的部分」明確劃出來），大幅提高了它走向你要的方向的機率。這是流程問題，跟模型能力無關。

在從「下指令」到「蓋系統」裡，我說過「AI 只活在那一次的對話框裡」。宣告範圍是在這個限制之內，盡量讓它知道那個對話框的邊界在哪裡。

這個層面的治理做到什麼、做不到什麼

做得到的：讓 AI 在新 session 裡記得你的架構決策（記憶檔案）。讓它在不確定的時候問你而不是自己做決定（範圍宣告）。讓你在接受輸出之前有一個具體的查核點（evidence 習慣）。把技能文件控制在合理長度，避免注意力被稀釋（skill 選擇）。

做不到的是跨 session 的連貫狀態。記憶檔案解決的是「規則記得住」的問題，不是「上次做到哪裡」的問題。如果你的任務橫跨多個 session，每次開始你還是要手動交代背景——或者接受 AI 從頭重推一遍。Evidence 習慣的疲勞感也是真實的：問個五十次之後，你會想要自動化。這不是壞事——這是你已經知道在哪裡需要更正式的結構的訊號。範圍宣告在複雜任務下同樣會降解，涉及的模組愈多，「先說不要碰什麼」就愈難窮舉。

這個層面的治理是真實的，不是「沒有框架的窮人版」。但它有天花板。當你開始覺得每次的 context 交接很重複、evidence 問答讓你厭倦、範圍宣告的清單比任務本身還長——那就是你已經碰到這個層面的邊界了。

下一篇：Work Log：跨 session 的記憶機制

Agentic OS 是開源專案，記憶檔案的範本和設計說明都在這裡：github.com/KbWen/agentic-os

Why AI Agents Go Wrong: It's Not the Model

KbWen — Fri, 22 May 2026 12:00:00 +0800

TL;DR: “The agent did something wrong” usually gets diagnosed as a model problem. Most of the time it isn’t. Capability failures (wrong reasoning) and governance failures (no structure to catch wrong reasoning) look identical from the outside but need completely different fixes. This post is about telling them apart, and why most teams are currently solving the wrong one.

The agent said the feature was done. I asked for the commit SHA. There wasn’t one. When I checked the branch, two of the three modules it described implementing hadn’t changed.

The instinct in that moment is to reach for a better prompt, a smarter model, maybe a different tool call. That instinct is usually wrong.

What happened wasn’t a reasoning failure. The agent completed exactly the task it was given, interpreted through its own completion criterion, because no explicit one existed. There was no audit trail to check what it actually did. There was no scope boundary to constrain what “done” even meant. The model behaved correctly inside a system that gave it no structure to behave correctly toward.

That’s a governance failure, not a capability failure. And the fix is not a better model.

Two failure modes that look the same

When an agent produces bad output, the failure is almost always categorized as one thing: the AI got it wrong. Which leads to one solution category: better AI.

The problem is that “the AI got it wrong” conflates two distinct failure modes that have nothing to do with each other.

Capability failure: the model reasoned incorrectly. It missed a constraint, hallucinated a fact, drew a wrong inference. The fix lives in the model layer: better prompt, better retrieval, better fine-tuning, sometimes a more capable model.

Governance failure: the system had no invariant to catch or prevent what the agent did. The agent may have reasoned perfectly well and still produced a wrong outcome, because the surrounding structure gave it nothing to constrain against.

There’s a useful diagnostic test: would a smarter model have prevented this?

If yes, if the failure was clearly about incorrect reasoning or a factual miss, that’s a capability failure.

If no, if a brilliant expert given the same underspecified task would have made the same wrong choice, or a different wrong choice, because the task itself had no defined success condition. That’s a governance failure. Upgrading the model doesn’t help.

Most of the “unpredictable agent” complaints I’ve seen are governance failures. The problem gets framed as model unreliability because that’s what’s visible. The actual cause is invisible: the absence of structure.

The five structural gaps

These are the governance gaps that show up repeatedly, not as edge cases, but as the default state of most agent deployments. The zh-TW companion post AI 代理常見痛點與我們的嘗試 goes deeper on each one with narrative examples. Here I want to name the structural invariant that’s missing in each case.

Output not verifiable → no completion criterion or audit trail. The agent says “done.” You have no artifact to check against. The agent’s word that something happened is not evidence that it happened. The missing invariant: every task completion requires an attached evidence artifact: a file path, a commit SHA, a test result, something external to the conversation.

Steps skipped → no phase gate. Given a complex task, agents move toward output by the shortest path. Scope-setting, dependency mapping, impact analysis (anything that doesn’t look like “doing the thing”) gets skipped. The missing invariant: phases with entry and exit conditions that must be satisfied before proceeding. Pete Hodgson has written about this from an angle worth noting: when a problem has many valid solutions, the probability that an agent independently arrives at the one you actually wanted approaches zero. Pre-alignment isn’t overhead. It’s the phase gate that prevents redoing work.

Cross-session amnesia → no state handoff mechanism. Every new conversation is a blank slate. Decisions made in session one are unknown in session two. The agent rediscovers problems you’ve already solved, proposes patterns you’ve already rejected, rebuilds context you’ve already paid to build. An IEEE Spectrum report on AI coding tools documented this concretely: in longer sessions, agents increasingly regenerated functions that already existed and ignored conventions established earlier in the same session. The missing invariant: a structured work log that carries decisions forward across session boundaries. The mechanism we use is stupid-simple. It’s essentially forcing the agent to keep a diary. That description isn’t flattering, but cross-session amnesia is real enough that stupid-simple works.

Unbounded token cost → no resource scoping. An agent given a large task will read everything it can find, activate every relevant capability, and use as much context as the task allows it to justify. Without resource scoping, costs are unpredictable and you have no way to set expectations before a task starts. The missing invariant: task classification that routes to appropriately sized execution paths before the task begins.

Scope creep → no capability boundary. This is the quietest failure mode. The agent does what you asked, and also reorganizes a module you didn’t ask it to touch, and also “helpfully” updates a config file while it was in the neighborhood. Security researcher Johann Rehberger (Embrace the Red) made this failure mode concrete in April 2025 when he spent $500 testing Devin AI’s response to embedded instructions in GitHub issues, then reported the results to Cognition: 84–85% of attacks succeeded in getting the agent to execute actions outside the intended scope. That’s an extreme case, but the everyday version of this (the agent quietly expanding what “done” means) is the same structural gap. The missing invariant: explicit capability boundaries that define what the agent is allowed to do, not just what it’s been asked to do.

None of these gaps are model problems. A more capable model, given the same absent structure, makes the same category of errors, just more convincingly.

Engineering already solved these problems

These aren’t new problems with new solutions. They’re old problems that software engineering solved decades ago, applied to a different execution substrate.

Governance gap	Engineering equivalent
No completion criterion	CI gate: no merge without passing checks
No phase gate	PR review requirement: code doesn’t ship without sign-off
No state handoff	Audit log / ADR: decisions are recorded, not reconstructed
No resource scoping	Budget / SLA: bounded cost before work starts
No capability boundary	Principle of least privilege: access limited to what the task requires

The analogy isn’t decorative. These are the same structural mechanisms. Building the evidence requirement for Agentic OS, I kept writing things that felt novel until I realized I was describing CI gates and audit logs with different names. The insight wasn’t new. It was just late.

A CI gate doesn’t trust the developer’s word that the tests pass. It requires evidence. An audit log records decisions at the time they’re made, so they don’t need to be reconstructed from memory later. Least privilege limits what an agent can touch, not out of distrust, but to contain the blast radius when something goes wrong.

The AGENTS.md convention, now adopted across Claude Code, Cursor, and GitHub Copilot as a standard way for agents to load project context, is essentially a machine-readable project governance document. It’s the same idea as a team’s architecture decision record, but in a format the agent reads automatically. That’s not a coincidence. It’s the same structural need surfacing in a new context.

What’s missing in most agent deployments isn’t better AI. It’s the application of mechanisms that software engineering already knows work.

What governance actually costs

“Adding structure” sounds like adding overhead. It’s worth being concrete about the actual numbers.

We measured governance overhead across several task types in Agentic OS v1.1 (April 2026, using chars/4 as the token estimation formula; actual counts vary by ±10% depending on tokenizer). For a quick-win task (something like fixing a date format in a CSV export), the governance overhead came to 17,041 tokens. For a complex feature touching API design, authentication, and database schema, it came to 50,975 tokens.

Those numbers sound large until you compare them to the cost of an ungoverned failure. A governance failure typically means: an undetected wrong completion that gets discovered later, a context restart, redone work, and scope cleanup. None of those costs are bounded or predictable.

The governance overhead is bounded. It scales with task complexity in a predictable way: the lightest path costs roughly 17K tokens; the heaviest measured scenario costs under 62K. The cost of recovering from a scope error or a missed completion criterion is not bounded. It depends on when you find it.

This isn’t an argument for any particular framework. It’s an argument for the structure itself: known, upfront cost versus unbounded, discovery-time cost. That trade-off is the same one CI gates resolved for software deployment twenty years ago.

The question to ask before the task starts

None of this requires a framework. The diagnostic test at the task level is simpler than that.

Before your next agent task: what artifact would prove this is done?

Not “what would it mean to be done.” That’s vague enough that the agent will fill in the answer. What artifact, specifically, would you point to afterward and say: here is the evidence this completed correctly?

If you can answer that question before the task starts, you have a completion criterion. If you can’t, you don’t, and the agent will invent one. That invented criterion is almost never the one you wanted. The everyday version doesn’t look like a security incident. It’s an agent that quietly refactored a module you didn’t mention, or updated a config file it found nearby. Its completion criterion included those things. Yours didn’t.

That’s the smallest possible governance structure. A definition of done, stated before work begins, tied to something observable.

The rest of the gaps (phase gates, state handoffs, resource scoping, capability boundaries) are the same logic applied at increasing scope. But they all start from the same place: deciding what “done” means before asking the agent to find out.

These observations are from building and using Agentic OS v1.1 (April 2026). The field moves fast — if a model capability has improved or a pattern here no longer holds, I want to know. The framework is open source and the issues are open: github.com/KbWen/agentic-os.

This post is part of a series on building real AI systems. Related reading: What Makes an AI Skill Different from a Prompt? covers the capability abstraction layer that sits below agent orchestration. The zh-TW companion post AI 代理常見痛點與我們的嘗試 covers the same failure catalogue with more narrative depth. Both build on Beyond Prompt: From Instructions to Building Systems.

AI 代理常見痛點與我們的嘗試

KbWen — Fri, 22 May 2026 10:00:00 +0800

TL;DR： AI 代理失控通常不是模型的問題，而是缺少足夠的結構。這篇整理了我們在實踐中觀察到的幾個痛點，以及 Agentic OS 試著用哪些方向來應對——不保證這是最好的做法，AI 工具本身也還在快速演化。

如果你已經在用 Claude Code、Cursor 或 Copilot 一段時間，你大概知道那種感覺：有時候它快得讓你懷疑自己為什麼還要打字，但有時候你盯著它的輸出，心裡只有一個念頭——「等等，它在幹嘛？」

印象更深的往往是後者。我發現有幾類問題會反覆出現，跟你用哪個模型或哪個工具關係不大，比較像是讓 AI 代理參與真實開發這件事本身帶來的結構性挑戰。

如果你讀過從「下指令」到「蓋系統」，這篇可以看成那個思路的延伸——當你開始用 agent 做真實開發，「結構不夠」這件事的代價變得具體很多。

Agentic OS 是站在很多公開工作的肩膀上做出來的。AGENTS.md 這個慣例最初來自 OpenAI Codex 的設計，後來被 Cursor、GitHub Copilot 等主流 AI 工具廣泛採納；Anthropic 有自己的 CLAUDE.md；Cursor 有 .cursor/rules——各自代表不同工具對「怎麼讓 AI 記住專案規則」這個問題的嘗試。我們參考了這些設計，加上 Hacker News、Reddit 社群裡的實測討論，還有 Pete Hodgson、Addy Osmani、Thorsten Ball 等工程師整理的失效模式分析，試著把它們整合成一套對我們自己有用的東西。這個框架比較像是整合與實驗的產物，不是從零發明的。

幾個反覆出現的痛點

以下整理自我們自己踩過的坑，也有部分來自社群的集體觀察。不是嚴謹的研究，是實踐者的筆記。

輸出難以核查

AI 完成任務後，你拿到的往往是一段文字說「已完成」或「功能已實作」。問題是「完成」的依據是什麼？在單一短對話裡這不是大問題，但一旦任務橫跨多個 session，或者事後需要追溯某個決策的來源，你往往什麼都找不到——沒有 commit SHA、沒有測試輸出、沒有可以指著說「它在這裡」的東西。只有對話紀錄，而對話紀錄不算數。

這個問題後來直接影響了我們的框架設計。Agentic OS 裡有一條規則：就算是「重讀同一份文件」這個動作，也必須留下一筆收據。聽起來很囉嗦，但沒有這個，「我讀過了」和「我沒讀過」在紀錄裡是完全一樣的。

跳過中間步驟

給 AI 一個任務，它的自然傾向是直接往結果走。這在小任務上沒問題。但任務稍微複雜一點——比如需要同時異動前端、後端和資料庫——省掉的「先確認範圍」、「列出影響的模組」這些步驟，往往要在後面以更大的代價補回來。工程師 Pete Hodgson 在他的文章裡提到，當一個問題有很多不同的解法時，AI 選到你心目中那個的機率趨近於零——提前對齊方向，跟模型能力無關，是流程問題。

跨對話的連貫性

在那篇談 Prompt 局限的文章裡，我說過 AI「只活在那一次的對話框裡」。這個限制在用 agent 做持續開發的時候感受更強烈。每次開新對話，你得重新交代背景：這個專案的架構決策是什麼、上次決定用哪種設計模式、之前踩過什麼坑。這不只是麻煩，而是會讓同樣的問題被重新發現、同樣的決策被重新討論。IEEE Spectrum 的一篇報導裡提到，AI 在長 session 的後期，出現重複生成已存在函式、忽視早期建立的 coding convention 等情況的頻率明顯上升——本質上是 context 稀釋的問題。

資源使用的不確定性

AI 代理讀文件、呼叫工具、產生輸出，這些都有成本，而且差距可以很大。我們在 Agentic OS v1.1 的 benchmark 裡（2026 年 4 月量測）跑了幾個真實場景：quick-win 等級任務（例如修一個 CSV 格式問題）實際消耗約 17,041 token；涵蓋 API、認證、資料庫的複雜功能開發則約 51,000 token，相差接近三倍。這些數字來自特定的任務類型與工具組合——我們用的估算公式是 chars / 4，接近多數 OpenAI tokenizer，但不完全一致——不同模型、context 策略下的結果可能差距顯著。

更複雜的是，這個計算現在又多了一層變數。主流模型——包括 Claude 和 OpenAI 的系列——已經有 prompt cache 機制，在某些條件下可以大幅降低重讀相同 context 的成本。這讓我們原本關於「怎麼控制 context 讀取策略」的很多設計假設需要重新檢視。我們還在觀察這個演變，舊的建議不一定還適用。

範圍的模糊

這類問題比較難描述，因為它不一定會報錯——它只是靜靜地做了你沒有要求它做的事。安全研究員 Johann Rehberger（筆名 Embrace the Red）花 $500 測試了 Devin AI 的 prompt injection 抵抗力，並於 2025 年 4 月將結果通報給 Devin 的開發商 Cognition。測試結果顯示透過 GitHub issue 嵌入惡意指令，可以讓 Devin 執行預期範圍以外的操作，整體攻擊成功率達 84–85%。這是極端的例子，但「AI 自己決定任務邊界」這件事的普通版本，每天都在發生——它只是偷偷多改了一個 config 檔，或者順手重構了你沒說要動的模組。

我們試著做的事

Agentic OS 的出發點，是試著在這些問題上加一些結構。主要思路有幾個方向：

我們把核心原則叫做 “No Evidence = No Completion”——想法本身不新奇，軟體工程裡的 CI/CD gate 做的就是這件事，只是把它搬到了 AI 代理的工作流程裡。每個任務的交付都要附帶某種形式的 evidence，不一定很複雜，但要有東西可以查。同時，根據任務的規模，要走的流程也不一樣：單行改動走輕量路徑；功能開發走比較完整的流程，包含計劃、實作、審查幾個階段。這個分層設計部分參考了 Anthropic 和 Cursor 社群分享的做法，調整成對我們自己比較實用的版本。

用 Work Log 保持連貫性。 每個任務有一份對應的工作記錄，記關鍵決策和目前狀態，讓下一個 session 能接續而不是重來。這是個很笨的方法（基本上就是強迫 AI 寫日記），但在我們找到更好的方式之前，它目前還算有用。

至於資源分配，我們試著把不同分類的任務對應到不同的 skill 載入策略，不一次讀所有東西。不過如前面說的，model cache 機制的演進讓這部分的設計面臨一些調整，舊的策略不一定還有效。

一些誠實的話

這套框架有用，但不是沒有問題——有些設計現在回頭看也不一定是最好的決定，只是當時看起來合理。Addy Osmani 把這個現象稱為「70% 問題」：AI 能很快帶你到 70% 的完成度，但剩下的 30% 往往需要更多工程判斷力，不是更少。設計一套治理框架也一樣——結構能幫你避開很多坑，但它改變不了你還是需要做設計決策這件事。

AI 工具的演進速度，讓任何固化的解法都有保鮮期的問題。有些我們在設計時試圖解決的問題，現在模型本身可能已經部分處理了；反過來，也有我們沒預想到的新狀況冒出來。我們把 Agentic OS 定位為一個持續演進的實驗，不是一個收斂的答案。這個系列會把框架的各個機制拆開來談。如果你也在摸索怎麼讓 AI 代理在實際開發工作裡更可控、更可追溯，希望有些地方能對你有參考價值。

下一篇：只用 Prompt 和技能也能做好治理：實用技巧與範例

Agentic OS 是開源專案，歡迎看看我們怎麼實作，也歡迎指出你覺得不對的地方：github.com/KbWen/agentic-os