Governance on KbWen Blog

No evidence, no completion

KbWen — Fri, 22 May 2026 20:00:00 +0800

TL;DR: “No evidence, no completion” is a single structural principle: a task isn’t done until the agent produces an artifact that exists outside the conversation and can be checked independently. It sounds trivial. In practice it closes most of the common agent failure modes in one rule, because the act of specifying what evidence looks like, before the task runs, forces you to define what “done” actually means.

In the previous post in this series I described an agent that said a feature was done (commit SHA requested, none existed, two of three modules unchanged). The failure had a name: no external completion criterion existed, so the agent supplied its own. That gap has a one-rule fix.

What “evidence” means here

Evidence is any artifact that exists outside the conversation and can be verified independently of what the agent said.

A commit SHA is evidence. A test output is evidence. A file path with a checksum is evidence. A screenshot of a passing CI run is evidence.

“I implemented it” is not evidence. “The feature is working” is not evidence. A description of what the agent did is not evidence: it’s the agent’s own assessment of its work, which is exactly what you’re trying to verify.

The distinction matters because conversation text is not auditable. It exists only within the session, can’t be pointed to by anyone who wasn’t there, and doesn’t prove the underlying state of the system. An artifact external to the conversation can be checked at any time, by anyone, against the actual state.

Why one rule covers so much

The first post in this series catalogued five structural gaps: no completion criterion, no phase gate, no state handoff, no resource scoping, no capability boundary. The evidence principle doesn’t replace all of them, but it forces the most important one: you cannot specify what evidence looks like without first deciding what “done” means.

If the evidence for a feature task is “passing tests + commit SHA on the feature branch,” you’ve implicitly defined the completion criterion, the scope boundary (the feature branch, not the main codebase), and a checkpoint for the phase gate. The evidence requirement is the handle that pulls the rest of the structure into place.

This is why the distributed systems framing maps so cleanly: delivery acknowledgment in a message queue is exactly this pattern. The queue doesn’t trust the worker’s internal state; it requires an external signal that the job completed. Decades of production systems run on that principle because systems without it fail in the same predictable way.

Before the task, not after

The principle works when it’s applied before the task starts, not as a review step after.

“What would prove this is done?” asked before the work begins forces a design decision. It’s not a check on the agent — it’s a check on the task specification. If you can’t answer it, the task isn’t specified well enough to run. If you can answer it but the answer is vague (“the feature works”), the vagueness is in your specification, not in the agent’s execution.

This is the mechanism Pete Hodgson’s analysis of AI coding tools points toward: when a problem has many valid solutions, the agent will pick one. That one will probably be valid. It probably won’t be the one you wanted. Specifying evidence before the task runs is a way of narrowing the solution space — the agent’s output has to satisfy the evidence criterion, which eliminates the paths that don’t.

In practice: “implement email verification” with no evidence criterion produces one kind of output. “Implement email verification — done when: (1) tests pass for OTP generation and expiry, (2) commit SHA on feat/email-verification” produces a different one. Same model. Different structure around it.

What good evidence looks like

Evidence should be:

External to the conversation. It can be retrieved or verified by someone who wasn’t in the session. A commit SHA can be looked up. A test output can be reproduced. A URL can be visited.

Specific enough to be falsifiable. “Tests pass” is weaker than “running npm test returns exit 0 with 47 tests passing.” The second can be false in a way that “tests pass” can’t — which is the point. If the evidence criterion can’t be falsified, it’s not doing the work.

Proportional to the task. A one-line bug fix doesn’t need a full audit trail. The evidence for a tiny fix is the commit SHA and a grep confirming the old string is gone. The evidence for a feature touching auth, API, and database schema is more involved: test output, migration SHA, API contract diff. The Agentic OS framework classifies tasks before they run partly to route to the appropriate evidence format: a quick-win task and an architecture-change task need different levels of proof.

The cost of specifying evidence

Specifying evidence costs something up front. It takes maybe two minutes to think through “what would prove this is done” before a task starts. That’s real overhead.

The comparison is with recovery cost. A governance failure (completing a task that didn’t actually complete, or completing it the wrong way) typically costs: discovering the error, rebuilding context, rerunning the work, and auditing scope. None of those costs are bounded. The two minutes up front is.

The Agentic OS v1.1 benchmark (April 2026, using chars/4 as the token estimation formula, ±10%) measured governance overhead for a quick-win task at roughly 17,000 tokens: the cost of the full structured lifecycle, evidence requirement included. For a complex feature spanning API design, auth, and database schema, it’s around 51,000 tokens. Those numbers are real costs. They’re also the ceiling. The cost of an undetected wrong completion has no ceiling — it depends on when you find it and how much work built on top of it.

The question to ask before your next task

Before you give an agent its next task: what artifact would prove this is done?

Not “what would it mean to be done” — that’s vague enough for the agent to fill in. What specific artifact, external to the conversation, would you point to afterward and say: here is the evidence this completed correctly.

If you have an answer, you have a completion criterion. If you don’t, you’re delegating the definition of “done” to the agent. It will define one. It almost never matches yours.

This post is part of a series on building real AI systems. The previous posts cover the two-failure taxonomy and the distributed systems prior art that motivates the evidence requirement. The framework is open source at github.com/KbWen/agentic-os.

Prior art: what distributed systems already knows

KbWen — Fri, 22 May 2026 16:00:00 +0800

TL;DR: The governance problems that make AI agents unpredictable (unverified completions, state loss between sessions, unconstrained scope) are structurally identical to problems distributed systems engineering solved with audit logs, delivery acknowledgment, state machines, and least-privilege access. The one genuine difference is non-determinism: an agent given the same open-ended task twice will do something different, which means governance needs to front-load constraints rather than just catch failures after. But the rest of the pattern library applies directly.

If you have built a message queue, you have hit a version of this bug: a worker picks up a job, does the work, then fails before sending the acknowledgment. The queue marks it undelivered. The job runs again. Now you have a duplicate record, a double email, or worse, depending on what “the job” was.

The fix is well-understood: require the worker to produce evidence of completion that the system can verify externally. Don’t trust the worker’s internal state. Trust the artifact.

When an AI agent says “done” and you have no artifact to check against, that’s the same design gap. The previous post in this series has a concrete example: the agent said the feature was done, I asked for the commit SHA, there wasn’t one, and two of the three modules it described implementing hadn’t changed. A capability failure looks like wrong reasoning. This was neither: the agent completed exactly what it was given, through its own completion criterion, because no external one existed. The fix is in the surrounding structure.

Distributed systems already solved the worker-reliability problem. The patterns map directly.

What agent execution looks like from the outside

Strip the language model out for a moment. What’s left?

A task arrives. A worker picks it up, performs operations, and signals completion. The orchestrator decides what to do next.

Standard async task pipeline. The governance questions are the same ones distributed systems have always asked: Did the work actually happen? What state is the system in now? What was the worker allowed to touch?

The answers (delivery acknowledgment, audit logs, state machines, capability sandboxing) aren’t novel. They exist because systems without them fail in predictable, documented ways. Agent deployments running without that structure encounter the same failure modes.

The pattern mapping

Distributed systems pattern	Agent governance equivalent
Delivery acknowledgment	Every task completion requires an external verifiable artifact: commit SHA, test output, file path
Idempotency key	Task dispatch is deduplicated: same task classified and scoped the same way, regardless of retry
Audit log / event sourcing	Work Log: decisions recorded at the time they happen, not reconstructed from memory later
State machine with explicit transitions	Phase gate: plan before implementing, review before shipping, with real entry/exit conditions
Least privilege / capability sandbox	Agent’s tool access scoped to what the specific task requires, not everything available
Resource quota	Task classification that routes work to an appropriately sized execution path before it begins

The Agentic OS framework is essentially this table implemented as a working system, not because it invented these patterns, but because building it kept arriving at the same structural answers distributed systems already had. The evidence requirement feels new until you recognize it as a CI gate. The work log feels novel until you recognize it as event sourcing. The insight isn’t original; it’s just overdue.

The one place the analogy breaks

Distributed systems assume deterministic workers. Same input, same output, retry is safe.

Agents aren’t deterministic, at least not for open-ended tasks. The same prompt, the same tools, the same context: execution goes somewhere different. Sometimes better. Often just different. For well-scoped sub-tasks (“run these tests and report failures,” “format this JSON to this schema”), retry still works fine. But for the tasks where governance matters most (feature implementation, refactoring decisions, scope-touching work), retry isn’t a recovery strategy; it’s another roll.

This is what Pete Hodgson’s analysis of AI coding tools points toward: when a problem has many valid solutions, the probability that an agent independently lands on the one you wanted approaches zero. The governance implication is that task decomposition is itself a governance act. Break work into pieces small enough that non-determinism is contained. Then front-load the constraints on the pieces that remain open-ended: define what “done” means, specify which files are in scope, classify the task before the first tool call.

The circuit breaker in distributed systems stops a cascade after failures accumulate. The agent equivalent is not letting the cascade start.

Where to instrument

Distributed systems tell you to instrument at the transition points: message intake, worker pickup, task completion, downstream dispatch. These are where state changes happen and where failures manifest.

The agent equivalent:

Task intake: Is this classified correctly? What phase path follows? What tools does it need, and only those?
Phase completion: What artifact exists to prove this phase is done? Is it external to the conversation?

The third transition point is worth more than a bullet. Session boundary is the agent-specific failure mode that has no clean distributed-systems equivalent: it’s closer to a stateless worker that loses its in-memory state and reprocesses from the queue head on restart. An IEEE Spectrum report on AI coding tools documented the pattern: in longer sessions, agents increasingly regenerated functions that already existed and ignored conventions established earlier. The fix is identical to the queue case: persistent state external to the worker. In agent terms: a work log that records decisions at the time they’re made, so the next session inherits context instead of reconstructing it.

Which gaps cost the most

The distributed systems frame doesn’t just explain why agent governance looks the way it does — it tells you which gaps cost the most.

Missing completion verification produces the cheapest failures: you find out fast. Missing scope constraints produce the expensive ones: the agent did three things you didn’t ask for, two of which were correct, and now you’re auditing which is which. Missing session state produces the hidden ones: the agent solved a problem you already solved, using a pattern you already decided against, because it had no way to know.

If you’re choosing where to add structure first: start with scope. The task intake gate is the circuit breaker — it constrains what the agent can reach before it runs. The work log is the audit trail you need after something goes wrong. The completion artifact is the acknowledgment the queue was never getting.

Add them in that order.

This post is part of a series on building real AI systems. The previous post, Why AI Agents Go Wrong: It’s Not the Model, covers the capability vs. governance failure taxonomy that motivates this framing. Next: No Evidence, No Completion takes the evidence requirement as a standalone principle and shows what it looks like in practice. The framework is open source at github.com/KbWen/agentic-os.

只用 Prompt 和技能，也能做到基本治理

KbWen — Fri, 22 May 2026 14:00:00 +0800

TL;DR： 在裝任何框架之前，有一層治理是免費的：在專案根目錄放一個 AGENTS.md 或 CLAUDE.md，養成開口要求 evidence 的習慣，開始任務前先說清楚什麼不能動。這三件事不能替代跨 session 的狀態管理，但能擋掉大部分常見問題。這篇說的就是怎麼做、做到什麼程度、在哪裡會失效。

有一段時間我的 Claude Code 工作流裡沒有任何框架，只有對話和一堆臨時 prompt。某天我做了兩個改變：把專案的架構決策寫進一個 CLAUDE.md，還有在每次 AI 說「好了」的時候問一句「commit SHA 是什麼？」

一類問題幾乎消失了：AI 在新 session 裡對著不存在的設計模式寫程式碼的情況，以及我接受了「完成」卻發現什麼都沒變的情況。不是所有問題都解決了。但那兩件事的性價比，讓我後來開始認真想「在裝框架之前，這個層面的治理到底能做多少」。

這篇是AI 代理常見痛點與我們的嘗試的延伸。那篇列了五個反覆出現的問題，這篇專門回答：只靠 prompt 習慣和 skill 選擇，能解決多少？

記憶檔案：解決跨 session 失憶的最低成本方案

AI 代理在每一個新對話都是空白狀態。它不記得上次的架構決策，不記得你說過不要用哪個 pattern，也不記得你已經有一個 utils/auth.ts，所以它再寫一個新的。這個問題在 IEEE Spectrum 的報導裡有量測數據：長 session 後期，AI 重複生成已存在函式、忽視早期建立的 coding convention 的頻率明顯上升。

三個工具在試圖解決同一個問題：

AGENTS.md 是 OpenAI Codex 最初設計的慣例，後來被 Cursor、GitHub Copilot 和 Google Antigravity 等主流工具廣泛採納。它的設計邏輯是：在任何工具讀取它之前，先告訴工具「這個專案是怎麼運作的、你可以做什麼、不可以做什麼」。

CLAUDE.md 是 Anthropic 針對 Claude Code 的版本。Claude Code 在每個新 session 開始時自動注入這個檔案的內容，所以你放在這裡的東西就等於是每次都在對話開頭重新說一遍。

.cursor/rules 是 Cursor 的對應物。原理相同。

這三個慣例同時存在，說明「怎麼讓 AI 記住專案規則」這個問題是通用的，不是某個工具特有的。選哪個取決於你主要用什麼工具，你不需要三個都放，放一個就有效果。

這類記憶檔案最有用的內容通常是三類：架構限制（「這個 repo 用 Repository pattern，不要把業務邏輯寫進 controller」）、命名規範（「service 命名用 XxxService，不要用 XxxManager」）、以及「不要碰」清單（「/database/migrations 只有在明確被要求的時候才能動」）。

一個重要的注意：這類檔案要短。研究觀察和實踐都指向同一個上限：200 行、2000 token 以內。超過這個長度，重要的規則會被稀釋。AI 技術上還是讀了整個檔案，但前面讀到的東西到後面已經注意力不足。寫 CLAUDE.md 的時候，如果你覺得需要加第六條規則，先問自己第一條能不能刪掉。

Skill 選擇：要求愈具體，干擾愈少

在 Claude Code 或 Cursor 的一次工作 session 裡，你可以載入很多 context：整個 codebase 的 README、過去的對話歷史、多個技能文件。但「載入愈多愈好」是個陷阱。

一個改一行 typo 的任務，不需要知道整套測試策略、部署規範和 API 設計原則。把這些全部塞進 context，不會讓 AI 更謹慎，只會讓它在「哪些規則現在適用」這件事上分配更少的注意力給真正重要的那個。

這不是 Agentic OS 特有的問題，是任何 Claude Code 或 Cursor session 都存在的情況。具體做法是：開始一個任務之前，先想清楚這個任務需要知道什麼，然後只提供那些。一個 tiny-fix 說「這是那行 code，幫我修」就夠了；一個涉及多個模組的功能開發才需要交代設計模式、測試策略和資料庫規範。

結果是給了 AI 密度更高的相關資訊，不是更少的資訊。

Evidence 習慣：不問則不說

這是成本最低的一個改變，也是讓我最驚訝的一個。

AI 說「完成了」的時候，它有可能真的完成了，也有可能完成了 90% 然後遇到小問題就繞過去了，也有可能整個理解方向就錯了。這三種情況在它的輸出裡，有時候看起來幾乎一樣。

養成一個習慣：在接受任何「完成」之前，要求一個具體的 artifact。

不是一個表單，也不是一套流程，就一句話：「commit SHA 是什麼？」「把 test 跑一遍，貼輸出給我」「你改了哪個檔案，第幾行？」

這個習慣有效的原因不只是讓你可以查。問這個問題本身會讓 AI 把它沒說清楚的地方說出來。 很多時候，我問「測試有過嗎」，它才會說「啊，那個測試我還沒跑，因為 X 的 setup 有問題」，而這個資訊如果我沒問，它可能就默默略過了。

誠實地說：這個習慣很累。問了十幾次之後你開始理解為什麼人們想要自動化這件事，框架裡的 evidence gate 就是把這個問答自動執行。但作為一個習慣，它能擋掉大概六七成的「接受了看起來完成的東西、後來發現沒有」的情況。

範圍宣告：先說不要碰什麼

開始一個複雜任務之前，明確告訴 AI 它應該不要碰什麼。

具體的說法比模糊的說法有效：「你在做 authentication module。除非我明確說，不要碰 /api/payments 和 /database/migrations 底下的任何東西」比「專注在 auth 就好」有用得多。

原因不完全是「AI 會遵守」，它不是每次都遵守。而是宣告了邊界之後，AI 在不確定的時候開始問問題而不是自己決定。我給了這樣的指令之後，在它原本會直接去改 payments module 的地方，它變成問我「這邊需要我更新 payment 的驗證邏輯嗎？」——這個轉變很有價值。

這個觀察跟 Pete Hodgson 對 AI coding assistant 失效模式的分析有直接的關係：當一個問題存在很多可能的解法，AI 選中你心目中那個的機率趨近於零。把解法空間縮小（也包括把「不能碰的部分」明確劃出來），大幅提高了它走向你要的方向的機率。這是流程問題，跟模型能力無關。

在從「下指令」到「蓋系統」裡，我說過「AI 只活在那一次的對話框裡」。宣告範圍是在這個限制之內，盡量讓它知道那個對話框的邊界在哪裡。

這個層面的治理做到什麼、做不到什麼

做得到的：讓 AI 在新 session 裡記得你的架構決策（記憶檔案）。讓它在不確定的時候問你而不是自己做決定（範圍宣告）。讓你在接受輸出之前有一個具體的查核點（evidence 習慣）。把技能文件控制在合理長度，避免注意力被稀釋（skill 選擇）。

做不到的是跨 session 的連貫狀態。記憶檔案解決的是「規則記得住」的問題，不是「上次做到哪裡」的問題。如果你的任務橫跨多個 session，每次開始你還是要手動交代背景——或者接受 AI 從頭重推一遍。Evidence 習慣的疲勞感也是真實的：問個五十次之後，你會想要自動化。這不是壞事——這是你已經知道在哪裡需要更正式的結構的訊號。範圍宣告在複雜任務下同樣會降解，涉及的模組愈多，「先說不要碰什麼」就愈難窮舉。

這個層面的治理是真實的，不是「沒有框架的窮人版」。但它有天花板。當你開始覺得每次的 context 交接很重複、evidence 問答讓你厭倦、範圍宣告的清單比任務本身還長——那就是你已經碰到這個層面的邊界了。

下一篇：Work Log：跨 session 的記憶機制

Agentic OS 是開源專案，記憶檔案的範本和設計說明都在這裡：github.com/KbWen/agentic-os

Why AI Agents Go Wrong: It's Not the Model

KbWen — Fri, 22 May 2026 12:00:00 +0800

TL;DR: “The agent did something wrong” usually gets diagnosed as a model problem. Most of the time it isn’t. Capability failures (wrong reasoning) and governance failures (no structure to catch wrong reasoning) look identical from the outside but need completely different fixes. This post is about telling them apart, and why most teams are currently solving the wrong one.

The agent said the feature was done. I asked for the commit SHA. There wasn’t one. When I checked the branch, two of the three modules it described implementing hadn’t changed.

The instinct in that moment is to reach for a better prompt, a smarter model, maybe a different tool call. That instinct is usually wrong.

What happened wasn’t a reasoning failure. The agent completed exactly the task it was given, interpreted through its own completion criterion, because no explicit one existed. There was no audit trail to check what it actually did. There was no scope boundary to constrain what “done” even meant. The model behaved correctly inside a system that gave it no structure to behave correctly toward.

That’s a governance failure, not a capability failure. And the fix is not a better model.

Two failure modes that look the same

When an agent produces bad output, the failure is almost always categorized as one thing: the AI got it wrong. Which leads to one solution category: better AI.

The problem is that “the AI got it wrong” conflates two distinct failure modes that have nothing to do with each other.

Capability failure: the model reasoned incorrectly. It missed a constraint, hallucinated a fact, drew a wrong inference. The fix lives in the model layer: better prompt, better retrieval, better fine-tuning, sometimes a more capable model.

Governance failure: the system had no invariant to catch or prevent what the agent did. The agent may have reasoned perfectly well and still produced a wrong outcome, because the surrounding structure gave it nothing to constrain against.

There’s a useful diagnostic test: would a smarter model have prevented this?

If yes, if the failure was clearly about incorrect reasoning or a factual miss, that’s a capability failure.

If no, if a brilliant expert given the same underspecified task would have made the same wrong choice, or a different wrong choice, because the task itself had no defined success condition. That’s a governance failure. Upgrading the model doesn’t help.

Most of the “unpredictable agent” complaints I’ve seen are governance failures. The problem gets framed as model unreliability because that’s what’s visible. The actual cause is invisible: the absence of structure.

The five structural gaps

These are the governance gaps that show up repeatedly, not as edge cases, but as the default state of most agent deployments. The zh-TW companion post AI 代理常見痛點與我們的嘗試 goes deeper on each one with narrative examples. Here I want to name the structural invariant that’s missing in each case.

Output not verifiable → no completion criterion or audit trail. The agent says “done.” You have no artifact to check against. The agent’s word that something happened is not evidence that it happened. The missing invariant: every task completion requires an attached evidence artifact: a file path, a commit SHA, a test result, something external to the conversation.

Steps skipped → no phase gate. Given a complex task, agents move toward output by the shortest path. Scope-setting, dependency mapping, impact analysis (anything that doesn’t look like “doing the thing”) gets skipped. The missing invariant: phases with entry and exit conditions that must be satisfied before proceeding. Pete Hodgson has written about this from an angle worth noting: when a problem has many valid solutions, the probability that an agent independently arrives at the one you actually wanted approaches zero. Pre-alignment isn’t overhead. It’s the phase gate that prevents redoing work.

Cross-session amnesia → no state handoff mechanism. Every new conversation is a blank slate. Decisions made in session one are unknown in session two. The agent rediscovers problems you’ve already solved, proposes patterns you’ve already rejected, rebuilds context you’ve already paid to build. An IEEE Spectrum report on AI coding tools documented this concretely: in longer sessions, agents increasingly regenerated functions that already existed and ignored conventions established earlier in the same session. The missing invariant: a structured work log that carries decisions forward across session boundaries. The mechanism we use is stupid-simple. It’s essentially forcing the agent to keep a diary. That description isn’t flattering, but cross-session amnesia is real enough that stupid-simple works.

Unbounded token cost → no resource scoping. An agent given a large task will read everything it can find, activate every relevant capability, and use as much context as the task allows it to justify. Without resource scoping, costs are unpredictable and you have no way to set expectations before a task starts. The missing invariant: task classification that routes to appropriately sized execution paths before the task begins.

Scope creep → no capability boundary. This is the quietest failure mode. The agent does what you asked, and also reorganizes a module you didn’t ask it to touch, and also “helpfully” updates a config file while it was in the neighborhood. Security researcher Johann Rehberger (Embrace the Red) made this failure mode concrete in April 2025 when he spent $500 testing Devin AI’s response to embedded instructions in GitHub issues, then reported the results to Cognition: 84–85% of attacks succeeded in getting the agent to execute actions outside the intended scope. That’s an extreme case, but the everyday version of this (the agent quietly expanding what “done” means) is the same structural gap. The missing invariant: explicit capability boundaries that define what the agent is allowed to do, not just what it’s been asked to do.

None of these gaps are model problems. A more capable model, given the same absent structure, makes the same category of errors, just more convincingly.

Engineering already solved these problems

These aren’t new problems with new solutions. They’re old problems that software engineering solved decades ago, applied to a different execution substrate.

Governance gap	Engineering equivalent
No completion criterion	CI gate: no merge without passing checks
No phase gate	PR review requirement: code doesn’t ship without sign-off
No state handoff	Audit log / ADR: decisions are recorded, not reconstructed
No resource scoping	Budget / SLA: bounded cost before work starts
No capability boundary	Principle of least privilege: access limited to what the task requires

The analogy isn’t decorative. These are the same structural mechanisms. Building the evidence requirement for Agentic OS, I kept writing things that felt novel until I realized I was describing CI gates and audit logs with different names. The insight wasn’t new. It was just late.

A CI gate doesn’t trust the developer’s word that the tests pass. It requires evidence. An audit log records decisions at the time they’re made, so they don’t need to be reconstructed from memory later. Least privilege limits what an agent can touch, not out of distrust, but to contain the blast radius when something goes wrong.

The AGENTS.md convention, now adopted across Claude Code, Cursor, and GitHub Copilot as a standard way for agents to load project context, is essentially a machine-readable project governance document. It’s the same idea as a team’s architecture decision record, but in a format the agent reads automatically. That’s not a coincidence. It’s the same structural need surfacing in a new context.

What’s missing in most agent deployments isn’t better AI. It’s the application of mechanisms that software engineering already knows work.

What governance actually costs

“Adding structure” sounds like adding overhead. It’s worth being concrete about the actual numbers.

We measured governance overhead across several task types in Agentic OS v1.1 (April 2026, using chars/4 as the token estimation formula; actual counts vary by ±10% depending on tokenizer). For a quick-win task (something like fixing a date format in a CSV export), the governance overhead came to 17,041 tokens. For a complex feature touching API design, authentication, and database schema, it came to 50,975 tokens.

Those numbers sound large until you compare them to the cost of an ungoverned failure. A governance failure typically means: an undetected wrong completion that gets discovered later, a context restart, redone work, and scope cleanup. None of those costs are bounded or predictable.

The governance overhead is bounded. It scales with task complexity in a predictable way: the lightest path costs roughly 17K tokens; the heaviest measured scenario costs under 62K. The cost of recovering from a scope error or a missed completion criterion is not bounded. It depends on when you find it.

This isn’t an argument for any particular framework. It’s an argument for the structure itself: known, upfront cost versus unbounded, discovery-time cost. That trade-off is the same one CI gates resolved for software deployment twenty years ago.

The question to ask before the task starts

None of this requires a framework. The diagnostic test at the task level is simpler than that.

Before your next agent task: what artifact would prove this is done?

Not “what would it mean to be done.” That’s vague enough that the agent will fill in the answer. What artifact, specifically, would you point to afterward and say: here is the evidence this completed correctly?

If you can answer that question before the task starts, you have a completion criterion. If you can’t, you don’t, and the agent will invent one. That invented criterion is almost never the one you wanted. The everyday version doesn’t look like a security incident. It’s an agent that quietly refactored a module you didn’t mention, or updated a config file it found nearby. Its completion criterion included those things. Yours didn’t.

That’s the smallest possible governance structure. A definition of done, stated before work begins, tied to something observable.

The rest of the gaps (phase gates, state handoffs, resource scoping, capability boundaries) are the same logic applied at increasing scope. But they all start from the same place: deciding what “done” means before asking the agent to find out.

These observations are from building and using Agentic OS v1.1 (April 2026). The field moves fast — if a model capability has improved or a pattern here no longer holds, I want to know. The framework is open source and the issues are open: github.com/KbWen/agentic-os.

This post is part of a series on building real AI systems. Related reading: What Makes an AI Skill Different from a Prompt? covers the capability abstraction layer that sits below agent orchestration. The zh-TW companion post AI 代理常見痛點與我們的嘗試 covers the same failure catalogue with more narrative depth. Both build on Beyond Prompt: From Instructions to Building Systems.

AI 代理常見痛點與我們的嘗試

KbWen — Fri, 22 May 2026 10:00:00 +0800

TL;DR： AI 代理失控通常不是模型的問題，而是缺少足夠的結構。這篇整理了我們在實踐中觀察到的幾個痛點，以及 Agentic OS 試著用哪些方向來應對——不保證這是最好的做法，AI 工具本身也還在快速演化。

如果你已經在用 Claude Code、Cursor 或 Copilot 一段時間，你大概知道那種感覺：有時候它快得讓你懷疑自己為什麼還要打字，但有時候你盯著它的輸出，心裡只有一個念頭——「等等，它在幹嘛？」

印象更深的往往是後者。我發現有幾類問題會反覆出現，跟你用哪個模型或哪個工具關係不大，比較像是讓 AI 代理參與真實開發這件事本身帶來的結構性挑戰。

如果你讀過從「下指令」到「蓋系統」，這篇可以看成那個思路的延伸——當你開始用 agent 做真實開發，「結構不夠」這件事的代價變得具體很多。

Agentic OS 是站在很多公開工作的肩膀上做出來的。AGENTS.md 這個慣例最初來自 OpenAI Codex 的設計，後來被 Cursor、GitHub Copilot 等主流 AI 工具廣泛採納；Anthropic 有自己的 CLAUDE.md；Cursor 有 .cursor/rules——各自代表不同工具對「怎麼讓 AI 記住專案規則」這個問題的嘗試。我們參考了這些設計，加上 Hacker News、Reddit 社群裡的實測討論，還有 Pete Hodgson、Addy Osmani、Thorsten Ball 等工程師整理的失效模式分析，試著把它們整合成一套對我們自己有用的東西。這個框架比較像是整合與實驗的產物，不是從零發明的。

幾個反覆出現的痛點

以下整理自我們自己踩過的坑，也有部分來自社群的集體觀察。不是嚴謹的研究，是實踐者的筆記。

輸出難以核查

AI 完成任務後，你拿到的往往是一段文字說「已完成」或「功能已實作」。問題是「完成」的依據是什麼？在單一短對話裡這不是大問題，但一旦任務橫跨多個 session，或者事後需要追溯某個決策的來源，你往往什麼都找不到——沒有 commit SHA、沒有測試輸出、沒有可以指著說「它在這裡」的東西。只有對話紀錄，而對話紀錄不算數。

這個問題後來直接影響了我們的框架設計。Agentic OS 裡有一條規則：就算是「重讀同一份文件」這個動作，也必須留下一筆收據。聽起來很囉嗦，但沒有這個，「我讀過了」和「我沒讀過」在紀錄裡是完全一樣的。

跳過中間步驟

給 AI 一個任務，它的自然傾向是直接往結果走。這在小任務上沒問題。但任務稍微複雜一點——比如需要同時異動前端、後端和資料庫——省掉的「先確認範圍」、「列出影響的模組」這些步驟，往往要在後面以更大的代價補回來。工程師 Pete Hodgson 在他的文章裡提到，當一個問題有很多不同的解法時，AI 選到你心目中那個的機率趨近於零——提前對齊方向，跟模型能力無關，是流程問題。

跨對話的連貫性

在那篇談 Prompt 局限的文章裡，我說過 AI「只活在那一次的對話框裡」。這個限制在用 agent 做持續開發的時候感受更強烈。每次開新對話，你得重新交代背景：這個專案的架構決策是什麼、上次決定用哪種設計模式、之前踩過什麼坑。這不只是麻煩，而是會讓同樣的問題被重新發現、同樣的決策被重新討論。IEEE Spectrum 的一篇報導裡提到，AI 在長 session 的後期，出現重複生成已存在函式、忽視早期建立的 coding convention 等情況的頻率明顯上升——本質上是 context 稀釋的問題。

資源使用的不確定性

AI 代理讀文件、呼叫工具、產生輸出，這些都有成本，而且差距可以很大。我們在 Agentic OS v1.1 的 benchmark 裡（2026 年 4 月量測）跑了幾個真實場景：quick-win 等級任務（例如修一個 CSV 格式問題）實際消耗約 17,041 token；涵蓋 API、認證、資料庫的複雜功能開發則約 51,000 token，相差接近三倍。這些數字來自特定的任務類型與工具組合——我們用的估算公式是 chars / 4，接近多數 OpenAI tokenizer，但不完全一致——不同模型、context 策略下的結果可能差距顯著。

更複雜的是，這個計算現在又多了一層變數。主流模型——包括 Claude 和 OpenAI 的系列——已經有 prompt cache 機制，在某些條件下可以大幅降低重讀相同 context 的成本。這讓我們原本關於「怎麼控制 context 讀取策略」的很多設計假設需要重新檢視。我們還在觀察這個演變，舊的建議不一定還適用。

範圍的模糊

這類問題比較難描述，因為它不一定會報錯——它只是靜靜地做了你沒有要求它做的事。安全研究員 Johann Rehberger（筆名 Embrace the Red）花 $500 測試了 Devin AI 的 prompt injection 抵抗力，並於 2025 年 4 月將結果通報給 Devin 的開發商 Cognition。測試結果顯示透過 GitHub issue 嵌入惡意指令，可以讓 Devin 執行預期範圍以外的操作，整體攻擊成功率達 84–85%。這是極端的例子，但「AI 自己決定任務邊界」這件事的普通版本，每天都在發生——它只是偷偷多改了一個 config 檔，或者順手重構了你沒說要動的模組。

我們試著做的事

Agentic OS 的出發點，是試著在這些問題上加一些結構。主要思路有幾個方向：

我們把核心原則叫做 “No Evidence = No Completion”——想法本身不新奇，軟體工程裡的 CI/CD gate 做的就是這件事，只是把它搬到了 AI 代理的工作流程裡。每個任務的交付都要附帶某種形式的 evidence，不一定很複雜，但要有東西可以查。同時，根據任務的規模，要走的流程也不一樣：單行改動走輕量路徑；功能開發走比較完整的流程，包含計劃、實作、審查幾個階段。這個分層設計部分參考了 Anthropic 和 Cursor 社群分享的做法，調整成對我們自己比較實用的版本。

用 Work Log 保持連貫性。 每個任務有一份對應的工作記錄，記關鍵決策和目前狀態，讓下一個 session 能接續而不是重來。這是個很笨的方法（基本上就是強迫 AI 寫日記），但在我們找到更好的方式之前，它目前還算有用。

至於資源分配，我們試著把不同分類的任務對應到不同的 skill 載入策略，不一次讀所有東西。不過如前面說的，model cache 機制的演進讓這部分的設計面臨一些調整，舊的策略不一定還有效。

一些誠實的話

這套框架有用，但不是沒有問題——有些設計現在回頭看也不一定是最好的決定，只是當時看起來合理。Addy Osmani 把這個現象稱為「70% 問題」：AI 能很快帶你到 70% 的完成度，但剩下的 30% 往往需要更多工程判斷力，不是更少。設計一套治理框架也一樣——結構能幫你避開很多坑，但它改變不了你還是需要做設計決策這件事。

AI 工具的演進速度，讓任何固化的解法都有保鮮期的問題。有些我們在設計時試圖解決的問題，現在模型本身可能已經部分處理了；反過來，也有我們沒預想到的新狀況冒出來。我們把 Agentic OS 定位為一個持續演進的實驗，不是一個收斂的答案。這個系列會把框架的各個機制拆開來談。如果你也在摸索怎麼讓 AI 代理在實際開發工作裡更可控、更可追溯，希望有些地方能對你有參考價值。

下一篇：只用 Prompt 和技能也能做好治理：實用技巧與範例

Agentic OS 是開源專案，歡迎看看我們怎麼實作，也歡迎指出你覺得不對的地方：github.com/KbWen/agentic-os