Agentic OS on KbWen Blog

No evidence, no completion

KbWen — Fri, 22 May 2026 20:00:00 +0800

TL;DR: “No evidence, no completion” is a single structural principle: a task isn’t done until the agent produces an artifact that exists outside the conversation and can be checked independently. It sounds trivial. In practice it closes most of the common agent failure modes in one rule, because the act of specifying what evidence looks like, before the task runs, forces you to define what “done” actually means.

In the previous post in this series I described an agent that said a feature was done (commit SHA requested, none existed, two of three modules unchanged). The failure had a name: no external completion criterion existed, so the agent supplied its own. That gap has a one-rule fix.

What “evidence” means here

Evidence is any artifact that exists outside the conversation and can be verified independently of what the agent said.

A commit SHA is evidence. A test output is evidence. A file path with a checksum is evidence. A screenshot of a passing CI run is evidence.

“I implemented it” is not evidence. “The feature is working” is not evidence. A description of what the agent did is not evidence: it’s the agent’s own assessment of its work, which is exactly what you’re trying to verify.

The distinction matters because conversation text is not auditable. It exists only within the session, can’t be pointed to by anyone who wasn’t there, and doesn’t prove the underlying state of the system. An artifact external to the conversation can be checked at any time, by anyone, against the actual state.

Why one rule covers so much

The first post in this series catalogued five structural gaps: no completion criterion, no phase gate, no state handoff, no resource scoping, no capability boundary. The evidence principle doesn’t replace all of them, but it forces the most important one: you cannot specify what evidence looks like without first deciding what “done” means.

If the evidence for a feature task is “passing tests + commit SHA on the feature branch,” you’ve implicitly defined the completion criterion, the scope boundary (the feature branch, not the main codebase), and a checkpoint for the phase gate. The evidence requirement is the handle that pulls the rest of the structure into place.

This is why the distributed systems framing maps so cleanly: delivery acknowledgment in a message queue is exactly this pattern. The queue doesn’t trust the worker’s internal state; it requires an external signal that the job completed. Decades of production systems run on that principle because systems without it fail in the same predictable way.

Before the task, not after

The principle works when it’s applied before the task starts, not as a review step after.

“What would prove this is done?” asked before the work begins forces a design decision. It’s not a check on the agent — it’s a check on the task specification. If you can’t answer it, the task isn’t specified well enough to run. If you can answer it but the answer is vague (“the feature works”), the vagueness is in your specification, not in the agent’s execution.

This is the mechanism Pete Hodgson’s analysis of AI coding tools points toward: when a problem has many valid solutions, the agent will pick one. That one will probably be valid. It probably won’t be the one you wanted. Specifying evidence before the task runs is a way of narrowing the solution space — the agent’s output has to satisfy the evidence criterion, which eliminates the paths that don’t.

In practice: “implement email verification” with no evidence criterion produces one kind of output. “Implement email verification — done when: (1) tests pass for OTP generation and expiry, (2) commit SHA on feat/email-verification” produces a different one. Same model. Different structure around it.

What good evidence looks like

Evidence should be:

External to the conversation. It can be retrieved or verified by someone who wasn’t in the session. A commit SHA can be looked up. A test output can be reproduced. A URL can be visited.

Specific enough to be falsifiable. “Tests pass” is weaker than “running npm test returns exit 0 with 47 tests passing.” The second can be false in a way that “tests pass” can’t — which is the point. If the evidence criterion can’t be falsified, it’s not doing the work.

Proportional to the task. A one-line bug fix doesn’t need a full audit trail. The evidence for a tiny fix is the commit SHA and a grep confirming the old string is gone. The evidence for a feature touching auth, API, and database schema is more involved: test output, migration SHA, API contract diff. The Agentic OS framework classifies tasks before they run partly to route to the appropriate evidence format: a quick-win task and an architecture-change task need different levels of proof.

The cost of specifying evidence

Specifying evidence costs something up front. It takes maybe two minutes to think through “what would prove this is done” before a task starts. That’s real overhead.

The comparison is with recovery cost. A governance failure (completing a task that didn’t actually complete, or completing it the wrong way) typically costs: discovering the error, rebuilding context, rerunning the work, and auditing scope. None of those costs are bounded. The two minutes up front is.

The Agentic OS v1.1 benchmark (April 2026, using chars/4 as the token estimation formula, ±10%) measured governance overhead for a quick-win task at roughly 17,000 tokens: the cost of the full structured lifecycle, evidence requirement included. For a complex feature spanning API design, auth, and database schema, it’s around 51,000 tokens. Those numbers are real costs. They’re also the ceiling. The cost of an undetected wrong completion has no ceiling — it depends on when you find it and how much work built on top of it.

The question to ask before your next task

Before you give an agent its next task: what artifact would prove this is done?

Not “what would it mean to be done” — that’s vague enough for the agent to fill in. What specific artifact, external to the conversation, would you point to afterward and say: here is the evidence this completed correctly.

If you have an answer, you have a completion criterion. If you don’t, you’re delegating the definition of “done” to the agent. It will define one. It almost never matches yours.

This post is part of a series on building real AI systems. The previous posts cover the two-failure taxonomy and the distributed systems prior art that motivates the evidence requirement. The framework is open source at github.com/KbWen/agentic-os.

Work Log：跨 session 的記憶機制

KbWen — Fri, 22 May 2026 18:00:00 +0800

TL;DR： Work Log 是一個很無聊的東西：一份 markdown 檔案，記錄這個任務做到哪裡、做了哪些決定、下個 session 要從哪裡繼續。它沒有解決 AI 的記憶問題，只是繞過它。但在我們找到更好的方法之前，它有效。

在那篇談治理基礎的文章裡，我說過 AI「只活在那一次的對話框裡」。這個說法的代價，在你真正開始用 AI 代理做持續開發的時候才會變得具體。

你跟 AI 花了一個小時討論這個 feature 要走哪個設計模式、為什麼不用另一個方案、資料庫的 schema 要怎麼調整。全部討論清楚了，開始實作。隔天開新對話，繼續做。AI 從頭來：哪個設計模式？資料庫？我不知道你說的是什麼。

這不是 Claude 的問題，也不是任何特定模型的問題。它就是這樣運作的。上一篇說到記憶檔案（CLAUDE.md、AGENTS.md）可以幫助 AI 記住專案的架構規則。但那解決的是「規則要記住」的問題，不是「這個任務做到哪裡」的問題。Work Log 是後者。

兩層記憶，兩個問題

先說清楚兩個東西的差別，因為我發現自己最初混在一起想。

專案記憶：這個專案的架構是什麼、用了哪些 ADR、活躍的任務清單在哪裡、哪些 skill 可以用。這是全域的、靜態的，跟任何一個具體任務無關。你不常去動它，但每次開新 session，AI 需要讀它才知道自己在什麼脈絡裡。

任務記憶（Work Log）：這個任務做到哪個 phase、做了哪些決定、下一個 session 要從哪裡繼續。它是動態的、per-task 的。一個任務一個檔案，在 Agentic OS 裡放在 .agentcortex/context/work/.md（完整結構見 repo）。

混在一起的後果是：要麼全域狀態被塞滿具體任務細節（之後沒人看得懂），要麼任務進度沒地方記（每次都從頭）。分開之後，兩個問題各有各的解。

Work Log 長什麼樣子

下面是一個簡化版的實際樣子（來自 github.com/KbWen/agentic-os）：

# Work Log: feat/email-verification

## Header
- Branch: feat/email-verification
- Classification: feature
- Current Phase: implement
- Checkpoint SHA: a3f9c12

## Task Description
新增 email OTP 驗證流程。使用者第一次登入後需完成驗證，
未驗證帳號只能讀取，不能寫入。

## Phase Sequence
| Phase     | Status      | Notes                    |
|-----------|-------------|--------------------------|
| bootstrap | completed   | 分類為 feature           |
| plan      | completed   | 確認走 OTP 不走 magic link |
| implement | in-progress | auth module 完成，email 發送待測 |
| review    | pending     |                          |

## Gate Evidence
- Gate: plan | Verdict: pass | At: 2026-05-10T14:00Z
- Gate: implement | Verdict: FAIL | Reason: email sending untested, scope not complete | At: 2026-05-11T09:00Z

## Phase Summary
- plan: 討論了 OTP vs magic link。決定用 OTP，因為我們的 email
  provider 有速率限制，magic link 的 retry 設計複雜度更高。
  這個決定要記住，下個 session 不要再討論。

關鍵不在格式，在那個 Phase Summary。每個完成的 phase，AI 要用一段話說：做了什麼決定、為什麼這樣決定、有什麼取捨。

這段話的作用不是給人讀的，是給下一個 session 的 AI 讀的。這個差別值得注意：給人讀的語言習慣加很多語境鋪墊，給 AI 讀的語言要決策密度高、歧義少。同樣一個決定，給人讀可能寫「考量效率後決定用 OTP」，給 AI 讀更好的形式是「用 OTP，不用 magic link，原因：provider rate limit + magic link retry 複雜度高，此決定封存，不再重新討論」。後者直接進入 AI 的 context，前者需要它自己推斷。

新對話開始，AI 讀了這份 Work Log，知道「OTP vs magic link 已經決定過了，不用再想」。它不會再建議你改用 magic link，因為那個決策已經被記錄並封存。

哪些東西值得記

不是所有事情都要寫進 Work Log。先說限制：一份幾百行的 Work Log，AI 讀完之後注意力也稀釋了。我們設定的上限是每個 Phase Summary 一段話，不超過五句。有這個上限，才值得想清楚什麼東西最值得占那個位置。

從我的觀察來看，排最前面的是這幾樣：

決策，尤其是否定的決策。 你決定不做某件事的原因，比你決定做某件事更容易被遺忘。「我們用 OTP 不用 magic link，因為 rate limit 問題」——如果沒記，下個 session 的 AI 大概又會建議 magic link。

當前 phase 的狀態。 做到哪一步、什麼東西是完成的、什麼還沒做。這讓新 session 可以從中間接續，不是從頭。

還有一類東西最容易被忽略：你在 implement phase 發現了一個沒有答案的問題。不要默默繼續，也不要讓 AI 自己想辦法繞過去。寫進去，下個 session 一開始就正面面對它。這類問題如果沒記，AI 下次遇到同一個岔路，十之八九走錯方向——不是因為它笨，是因為它不知道你已經知道那條路走不通。

這個方法的真實限制

說清楚它做不到的事，比說它能做什麼更重要。

Work Log 解決的是「把決策外化」的問題。AI 把決策寫在外部，下次讀回來，行為才一致。它沒有解決 AI 的狀態記憶問題，因為那個問題的解決需要模型架構層面的改變。Work Log 只是個繞路方案：既然 context 不能跨 session 存活，我們就把最重要的東西寫成文件，在 session 開始時重新注入。

這個繞路有個天花板。任務夠複雜的時候，Work Log 本身也會膨脹。你開始注意到你在寫一份文件，讓 AI 讀這份文件，再根據它繼續工作——而不是直接繼續工作。整個過程變得笨重。

還有一個現在可能不用擔心、但以後要注意的事：prompt cache 機制。Claude 和其他主流模型都有 prompt cache，在同一個 session 內重用相同 context 的成本很低（以 Claude 為例，cache TTL 大約是 5 分鐘到 1 小時）。如果你的任務可以在一個 session 裡完成，Work Log 的 ROI 其實有限——cache 幫你保住了 context，不用依賴外部記錄。Work Log 真正發揮的地方，是跨越多個 session 的任務，也就是 cache 早就失效的那種。

我們把 Agentic OS 的 Work Log 定位為一個夠用的暫時解，不是最終答案。AI 工具的 native memory 機制在快速發展，現在適合加 Work Log 的任務類型，一兩年後可能模型自己就能處理。這個觀察在整個系列的第一篇裡也說過：任何固化的解法都有保鮮期。

如果你想試試看

最輕量的開始：開一個任務，在任務開始前新增一個 markdown 檔案。三個區塊就夠：任務目標（一句話）、已決定的事（每次做了決定就加一行）、目前停在哪裡（每次結束 session 更新）。

不需要完整的 Work Log 格式。這三個區塊能擋掉大部分「下個 session 從頭來」的問題，原因很簡單——它逼你在 session 結束之前把當前狀態說清楚，而不是留給下一個 AI 自己猜。做複雜了再引入完整的 phase 結構，不用一開始就全部上。

Agentic OS 完整的 Work Log 模板在這裡，包含 phase 定義、gate evidence 格式和 handoff 結構。如果你每次開新對話的前 10 到 15 分鐘都在重新交代背景，而不是做事，那就是加 Work Log 的時機了。

這篇是 Agentic OS 系列的一部分。相關閱讀：只用 Prompt 和技能，也能做到基本治理說的是更輕量的做法，Work Log 是在那個基礎上加一層。AI 代理常見痛點與我們的嘗試是這個系列的入口。

Prior art: what distributed systems already knows

KbWen — Fri, 22 May 2026 16:00:00 +0800

TL;DR: The governance problems that make AI agents unpredictable (unverified completions, state loss between sessions, unconstrained scope) are structurally identical to problems distributed systems engineering solved with audit logs, delivery acknowledgment, state machines, and least-privilege access. The one genuine difference is non-determinism: an agent given the same open-ended task twice will do something different, which means governance needs to front-load constraints rather than just catch failures after. But the rest of the pattern library applies directly.

If you have built a message queue, you have hit a version of this bug: a worker picks up a job, does the work, then fails before sending the acknowledgment. The queue marks it undelivered. The job runs again. Now you have a duplicate record, a double email, or worse, depending on what “the job” was.

The fix is well-understood: require the worker to produce evidence of completion that the system can verify externally. Don’t trust the worker’s internal state. Trust the artifact.

When an AI agent says “done” and you have no artifact to check against, that’s the same design gap. The previous post in this series has a concrete example: the agent said the feature was done, I asked for the commit SHA, there wasn’t one, and two of the three modules it described implementing hadn’t changed. A capability failure looks like wrong reasoning. This was neither: the agent completed exactly what it was given, through its own completion criterion, because no external one existed. The fix is in the surrounding structure.

Distributed systems already solved the worker-reliability problem. The patterns map directly.

What agent execution looks like from the outside

Strip the language model out for a moment. What’s left?

A task arrives. A worker picks it up, performs operations, and signals completion. The orchestrator decides what to do next.

Standard async task pipeline. The governance questions are the same ones distributed systems have always asked: Did the work actually happen? What state is the system in now? What was the worker allowed to touch?

The answers (delivery acknowledgment, audit logs, state machines, capability sandboxing) aren’t novel. They exist because systems without them fail in predictable, documented ways. Agent deployments running without that structure encounter the same failure modes.

The pattern mapping

Distributed systems pattern	Agent governance equivalent
Delivery acknowledgment	Every task completion requires an external verifiable artifact: commit SHA, test output, file path
Idempotency key	Task dispatch is deduplicated: same task classified and scoped the same way, regardless of retry
Audit log / event sourcing	Work Log: decisions recorded at the time they happen, not reconstructed from memory later
State machine with explicit transitions	Phase gate: plan before implementing, review before shipping, with real entry/exit conditions
Least privilege / capability sandbox	Agent’s tool access scoped to what the specific task requires, not everything available
Resource quota	Task classification that routes work to an appropriately sized execution path before it begins

The Agentic OS framework is essentially this table implemented as a working system, not because it invented these patterns, but because building it kept arriving at the same structural answers distributed systems already had. The evidence requirement feels new until you recognize it as a CI gate. The work log feels novel until you recognize it as event sourcing. The insight isn’t original; it’s just overdue.

The one place the analogy breaks

Distributed systems assume deterministic workers. Same input, same output, retry is safe.

Agents aren’t deterministic, at least not for open-ended tasks. The same prompt, the same tools, the same context: execution goes somewhere different. Sometimes better. Often just different. For well-scoped sub-tasks (“run these tests and report failures,” “format this JSON to this schema”), retry still works fine. But for the tasks where governance matters most (feature implementation, refactoring decisions, scope-touching work), retry isn’t a recovery strategy; it’s another roll.

This is what Pete Hodgson’s analysis of AI coding tools points toward: when a problem has many valid solutions, the probability that an agent independently lands on the one you wanted approaches zero. The governance implication is that task decomposition is itself a governance act. Break work into pieces small enough that non-determinism is contained. Then front-load the constraints on the pieces that remain open-ended: define what “done” means, specify which files are in scope, classify the task before the first tool call.

The circuit breaker in distributed systems stops a cascade after failures accumulate. The agent equivalent is not letting the cascade start.

Where to instrument

Distributed systems tell you to instrument at the transition points: message intake, worker pickup, task completion, downstream dispatch. These are where state changes happen and where failures manifest.

The agent equivalent:

Task intake: Is this classified correctly? What phase path follows? What tools does it need, and only those?
Phase completion: What artifact exists to prove this phase is done? Is it external to the conversation?

The third transition point is worth more than a bullet. Session boundary is the agent-specific failure mode that has no clean distributed-systems equivalent: it’s closer to a stateless worker that loses its in-memory state and reprocesses from the queue head on restart. An IEEE Spectrum report on AI coding tools documented the pattern: in longer sessions, agents increasingly regenerated functions that already existed and ignored conventions established earlier. The fix is identical to the queue case: persistent state external to the worker. In agent terms: a work log that records decisions at the time they’re made, so the next session inherits context instead of reconstructing it.

Which gaps cost the most

The distributed systems frame doesn’t just explain why agent governance looks the way it does — it tells you which gaps cost the most.

Missing completion verification produces the cheapest failures: you find out fast. Missing scope constraints produce the expensive ones: the agent did three things you didn’t ask for, two of which were correct, and now you’re auditing which is which. Missing session state produces the hidden ones: the agent solved a problem you already solved, using a pattern you already decided against, because it had no way to know.

If you’re choosing where to add structure first: start with scope. The task intake gate is the circuit breaker — it constrains what the agent can reach before it runs. The work log is the audit trail you need after something goes wrong. The completion artifact is the acknowledgment the queue was never getting.

Add them in that order.

This post is part of a series on building real AI systems. The previous post, Why AI Agents Go Wrong: It’s Not the Model, covers the capability vs. governance failure taxonomy that motivates this framing. Next: No Evidence, No Completion takes the evidence requirement as a standalone principle and shows what it looks like in practice. The framework is open source at github.com/KbWen/agentic-os.

Why AI Agents Go Wrong: It's Not the Model

KbWen — Fri, 22 May 2026 12:00:00 +0800

TL;DR: “The agent did something wrong” usually gets diagnosed as a model problem. Most of the time it isn’t. Capability failures (wrong reasoning) and governance failures (no structure to catch wrong reasoning) look identical from the outside but need completely different fixes. This post is about telling them apart, and why most teams are currently solving the wrong one.

The agent said the feature was done. I asked for the commit SHA. There wasn’t one. When I checked the branch, two of the three modules it described implementing hadn’t changed.

The instinct in that moment is to reach for a better prompt, a smarter model, maybe a different tool call. That instinct is usually wrong.

What happened wasn’t a reasoning failure. The agent completed exactly the task it was given, interpreted through its own completion criterion, because no explicit one existed. There was no audit trail to check what it actually did. There was no scope boundary to constrain what “done” even meant. The model behaved correctly inside a system that gave it no structure to behave correctly toward.

That’s a governance failure, not a capability failure. And the fix is not a better model.

Two failure modes that look the same

When an agent produces bad output, the failure is almost always categorized as one thing: the AI got it wrong. Which leads to one solution category: better AI.

The problem is that “the AI got it wrong” conflates two distinct failure modes that have nothing to do with each other.

Capability failure: the model reasoned incorrectly. It missed a constraint, hallucinated a fact, drew a wrong inference. The fix lives in the model layer: better prompt, better retrieval, better fine-tuning, sometimes a more capable model.

Governance failure: the system had no invariant to catch or prevent what the agent did. The agent may have reasoned perfectly well and still produced a wrong outcome, because the surrounding structure gave it nothing to constrain against.

There’s a useful diagnostic test: would a smarter model have prevented this?

If yes, if the failure was clearly about incorrect reasoning or a factual miss, that’s a capability failure.

If no, if a brilliant expert given the same underspecified task would have made the same wrong choice, or a different wrong choice, because the task itself had no defined success condition. That’s a governance failure. Upgrading the model doesn’t help.

Most of the “unpredictable agent” complaints I’ve seen are governance failures. The problem gets framed as model unreliability because that’s what’s visible. The actual cause is invisible: the absence of structure.

The five structural gaps

These are the governance gaps that show up repeatedly, not as edge cases, but as the default state of most agent deployments. The zh-TW companion post AI 代理常見痛點與我們的嘗試 goes deeper on each one with narrative examples. Here I want to name the structural invariant that’s missing in each case.

Output not verifiable → no completion criterion or audit trail. The agent says “done.” You have no artifact to check against. The agent’s word that something happened is not evidence that it happened. The missing invariant: every task completion requires an attached evidence artifact: a file path, a commit SHA, a test result, something external to the conversation.

Steps skipped → no phase gate. Given a complex task, agents move toward output by the shortest path. Scope-setting, dependency mapping, impact analysis (anything that doesn’t look like “doing the thing”) gets skipped. The missing invariant: phases with entry and exit conditions that must be satisfied before proceeding. Pete Hodgson has written about this from an angle worth noting: when a problem has many valid solutions, the probability that an agent independently arrives at the one you actually wanted approaches zero. Pre-alignment isn’t overhead. It’s the phase gate that prevents redoing work.

Cross-session amnesia → no state handoff mechanism. Every new conversation is a blank slate. Decisions made in session one are unknown in session two. The agent rediscovers problems you’ve already solved, proposes patterns you’ve already rejected, rebuilds context you’ve already paid to build. An IEEE Spectrum report on AI coding tools documented this concretely: in longer sessions, agents increasingly regenerated functions that already existed and ignored conventions established earlier in the same session. The missing invariant: a structured work log that carries decisions forward across session boundaries. The mechanism we use is stupid-simple. It’s essentially forcing the agent to keep a diary. That description isn’t flattering, but cross-session amnesia is real enough that stupid-simple works.

Unbounded token cost → no resource scoping. An agent given a large task will read everything it can find, activate every relevant capability, and use as much context as the task allows it to justify. Without resource scoping, costs are unpredictable and you have no way to set expectations before a task starts. The missing invariant: task classification that routes to appropriately sized execution paths before the task begins.

Scope creep → no capability boundary. This is the quietest failure mode. The agent does what you asked, and also reorganizes a module you didn’t ask it to touch, and also “helpfully” updates a config file while it was in the neighborhood. Security researcher Johann Rehberger (Embrace the Red) made this failure mode concrete in April 2025 when he spent $500 testing Devin AI’s response to embedded instructions in GitHub issues, then reported the results to Cognition: 84–85% of attacks succeeded in getting the agent to execute actions outside the intended scope. That’s an extreme case, but the everyday version of this (the agent quietly expanding what “done” means) is the same structural gap. The missing invariant: explicit capability boundaries that define what the agent is allowed to do, not just what it’s been asked to do.

None of these gaps are model problems. A more capable model, given the same absent structure, makes the same category of errors, just more convincingly.

Engineering already solved these problems

These aren’t new problems with new solutions. They’re old problems that software engineering solved decades ago, applied to a different execution substrate.

Governance gap	Engineering equivalent
No completion criterion	CI gate: no merge without passing checks
No phase gate	PR review requirement: code doesn’t ship without sign-off
No state handoff	Audit log / ADR: decisions are recorded, not reconstructed
No resource scoping	Budget / SLA: bounded cost before work starts
No capability boundary	Principle of least privilege: access limited to what the task requires

The analogy isn’t decorative. These are the same structural mechanisms. Building the evidence requirement for Agentic OS, I kept writing things that felt novel until I realized I was describing CI gates and audit logs with different names. The insight wasn’t new. It was just late.

A CI gate doesn’t trust the developer’s word that the tests pass. It requires evidence. An audit log records decisions at the time they’re made, so they don’t need to be reconstructed from memory later. Least privilege limits what an agent can touch, not out of distrust, but to contain the blast radius when something goes wrong.

The AGENTS.md convention, now adopted across Claude Code, Cursor, and GitHub Copilot as a standard way for agents to load project context, is essentially a machine-readable project governance document. It’s the same idea as a team’s architecture decision record, but in a format the agent reads automatically. That’s not a coincidence. It’s the same structural need surfacing in a new context.

What’s missing in most agent deployments isn’t better AI. It’s the application of mechanisms that software engineering already knows work.

What governance actually costs

“Adding structure” sounds like adding overhead. It’s worth being concrete about the actual numbers.

We measured governance overhead across several task types in Agentic OS v1.1 (April 2026, using chars/4 as the token estimation formula; actual counts vary by ±10% depending on tokenizer). For a quick-win task (something like fixing a date format in a CSV export), the governance overhead came to 17,041 tokens. For a complex feature touching API design, authentication, and database schema, it came to 50,975 tokens.

Those numbers sound large until you compare them to the cost of an ungoverned failure. A governance failure typically means: an undetected wrong completion that gets discovered later, a context restart, redone work, and scope cleanup. None of those costs are bounded or predictable.

The governance overhead is bounded. It scales with task complexity in a predictable way: the lightest path costs roughly 17K tokens; the heaviest measured scenario costs under 62K. The cost of recovering from a scope error or a missed completion criterion is not bounded. It depends on when you find it.

This isn’t an argument for any particular framework. It’s an argument for the structure itself: known, upfront cost versus unbounded, discovery-time cost. That trade-off is the same one CI gates resolved for software deployment twenty years ago.

The question to ask before the task starts

None of this requires a framework. The diagnostic test at the task level is simpler than that.

Before your next agent task: what artifact would prove this is done?

Not “what would it mean to be done.” That’s vague enough that the agent will fill in the answer. What artifact, specifically, would you point to afterward and say: here is the evidence this completed correctly?

If you can answer that question before the task starts, you have a completion criterion. If you can’t, you don’t, and the agent will invent one. That invented criterion is almost never the one you wanted. The everyday version doesn’t look like a security incident. It’s an agent that quietly refactored a module you didn’t mention, or updated a config file it found nearby. Its completion criterion included those things. Yours didn’t.

That’s the smallest possible governance structure. A definition of done, stated before work begins, tied to something observable.

The rest of the gaps (phase gates, state handoffs, resource scoping, capability boundaries) are the same logic applied at increasing scope. But they all start from the same place: deciding what “done” means before asking the agent to find out.

These observations are from building and using Agentic OS v1.1 (April 2026). The field moves fast — if a model capability has improved or a pattern here no longer holds, I want to know. The framework is open source and the issues are open: github.com/KbWen/agentic-os.

This post is part of a series on building real AI systems. Related reading: What Makes an AI Skill Different from a Prompt? covers the capability abstraction layer that sits below agent orchestration. The zh-TW companion post AI 代理常見痛點與我們的嘗試 covers the same failure catalogue with more narrative depth. Both build on Beyond Prompt: From Instructions to Building Systems.

AI 代理常見痛點與我們的嘗試

KbWen — Fri, 22 May 2026 10:00:00 +0800

TL;DR： AI 代理失控通常不是模型的問題，而是缺少足夠的結構。這篇整理了我們在實踐中觀察到的幾個痛點，以及 Agentic OS 試著用哪些方向來應對——不保證這是最好的做法，AI 工具本身也還在快速演化。

如果你已經在用 Claude Code、Cursor 或 Copilot 一段時間，你大概知道那種感覺：有時候它快得讓你懷疑自己為什麼還要打字，但有時候你盯著它的輸出，心裡只有一個念頭——「等等，它在幹嘛？」

印象更深的往往是後者。我發現有幾類問題會反覆出現，跟你用哪個模型或哪個工具關係不大，比較像是讓 AI 代理參與真實開發這件事本身帶來的結構性挑戰。

如果你讀過從「下指令」到「蓋系統」，這篇可以看成那個思路的延伸——當你開始用 agent 做真實開發，「結構不夠」這件事的代價變得具體很多。

Agentic OS 是站在很多公開工作的肩膀上做出來的。AGENTS.md 這個慣例最初來自 OpenAI Codex 的設計，後來被 Cursor、GitHub Copilot 等主流 AI 工具廣泛採納；Anthropic 有自己的 CLAUDE.md；Cursor 有 .cursor/rules——各自代表不同工具對「怎麼讓 AI 記住專案規則」這個問題的嘗試。我們參考了這些設計，加上 Hacker News、Reddit 社群裡的實測討論，還有 Pete Hodgson、Addy Osmani、Thorsten Ball 等工程師整理的失效模式分析，試著把它們整合成一套對我們自己有用的東西。這個框架比較像是整合與實驗的產物，不是從零發明的。

幾個反覆出現的痛點

以下整理自我們自己踩過的坑，也有部分來自社群的集體觀察。不是嚴謹的研究，是實踐者的筆記。

輸出難以核查

AI 完成任務後，你拿到的往往是一段文字說「已完成」或「功能已實作」。問題是「完成」的依據是什麼？在單一短對話裡這不是大問題，但一旦任務橫跨多個 session，或者事後需要追溯某個決策的來源，你往往什麼都找不到——沒有 commit SHA、沒有測試輸出、沒有可以指著說「它在這裡」的東西。只有對話紀錄，而對話紀錄不算數。

這個問題後來直接影響了我們的框架設計。Agentic OS 裡有一條規則：就算是「重讀同一份文件」這個動作，也必須留下一筆收據。聽起來很囉嗦，但沒有這個，「我讀過了」和「我沒讀過」在紀錄裡是完全一樣的。

跳過中間步驟

給 AI 一個任務，它的自然傾向是直接往結果走。這在小任務上沒問題。但任務稍微複雜一點——比如需要同時異動前端、後端和資料庫——省掉的「先確認範圍」、「列出影響的模組」這些步驟，往往要在後面以更大的代價補回來。工程師 Pete Hodgson 在他的文章裡提到，當一個問題有很多不同的解法時，AI 選到你心目中那個的機率趨近於零——提前對齊方向，跟模型能力無關，是流程問題。

跨對話的連貫性

在那篇談 Prompt 局限的文章裡，我說過 AI「只活在那一次的對話框裡」。這個限制在用 agent 做持續開發的時候感受更強烈。每次開新對話，你得重新交代背景：這個專案的架構決策是什麼、上次決定用哪種設計模式、之前踩過什麼坑。這不只是麻煩，而是會讓同樣的問題被重新發現、同樣的決策被重新討論。IEEE Spectrum 的一篇報導裡提到，AI 在長 session 的後期，出現重複生成已存在函式、忽視早期建立的 coding convention 等情況的頻率明顯上升——本質上是 context 稀釋的問題。

資源使用的不確定性

AI 代理讀文件、呼叫工具、產生輸出，這些都有成本，而且差距可以很大。我們在 Agentic OS v1.1 的 benchmark 裡（2026 年 4 月量測）跑了幾個真實場景：quick-win 等級任務（例如修一個 CSV 格式問題）實際消耗約 17,041 token；涵蓋 API、認證、資料庫的複雜功能開發則約 51,000 token，相差接近三倍。這些數字來自特定的任務類型與工具組合——我們用的估算公式是 chars / 4，接近多數 OpenAI tokenizer，但不完全一致——不同模型、context 策略下的結果可能差距顯著。

更複雜的是，這個計算現在又多了一層變數。主流模型——包括 Claude 和 OpenAI 的系列——已經有 prompt cache 機制，在某些條件下可以大幅降低重讀相同 context 的成本。這讓我們原本關於「怎麼控制 context 讀取策略」的很多設計假設需要重新檢視。我們還在觀察這個演變，舊的建議不一定還適用。

範圍的模糊

這類問題比較難描述，因為它不一定會報錯——它只是靜靜地做了你沒有要求它做的事。安全研究員 Johann Rehberger（筆名 Embrace the Red）花 $500 測試了 Devin AI 的 prompt injection 抵抗力，並於 2025 年 4 月將結果通報給 Devin 的開發商 Cognition。測試結果顯示透過 GitHub issue 嵌入惡意指令，可以讓 Devin 執行預期範圍以外的操作，整體攻擊成功率達 84–85%。這是極端的例子，但「AI 自己決定任務邊界」這件事的普通版本，每天都在發生——它只是偷偷多改了一個 config 檔，或者順手重構了你沒說要動的模組。

我們試著做的事

Agentic OS 的出發點，是試著在這些問題上加一些結構。主要思路有幾個方向：

我們把核心原則叫做 “No Evidence = No Completion”——想法本身不新奇，軟體工程裡的 CI/CD gate 做的就是這件事，只是把它搬到了 AI 代理的工作流程裡。每個任務的交付都要附帶某種形式的 evidence，不一定很複雜，但要有東西可以查。同時，根據任務的規模，要走的流程也不一樣：單行改動走輕量路徑；功能開發走比較完整的流程，包含計劃、實作、審查幾個階段。這個分層設計部分參考了 Anthropic 和 Cursor 社群分享的做法，調整成對我們自己比較實用的版本。

用 Work Log 保持連貫性。 每個任務有一份對應的工作記錄，記關鍵決策和目前狀態，讓下一個 session 能接續而不是重來。這是個很笨的方法（基本上就是強迫 AI 寫日記），但在我們找到更好的方式之前，它目前還算有用。

至於資源分配，我們試著把不同分類的任務對應到不同的 skill 載入策略，不一次讀所有東西。不過如前面說的，model cache 機制的演進讓這部分的設計面臨一些調整，舊的策略不一定還有效。

一些誠實的話

這套框架有用，但不是沒有問題——有些設計現在回頭看也不一定是最好的決定，只是當時看起來合理。Addy Osmani 把這個現象稱為「70% 問題」：AI 能很快帶你到 70% 的完成度，但剩下的 30% 往往需要更多工程判斷力，不是更少。設計一套治理框架也一樣——結構能幫你避開很多坑，但它改變不了你還是需要做設計決策這件事。

AI 工具的演進速度，讓任何固化的解法都有保鮮期的問題。有些我們在設計時試圖解決的問題，現在模型本身可能已經部分處理了；反過來，也有我們沒預想到的新狀況冒出來。我們把 Agentic OS 定位為一個持續演進的實驗，不是一個收斂的答案。這個系列會把框架的各個機制拆開來談。如果你也在摸索怎麼讓 AI 代理在實際開發工作裡更可控、更可追溯，希望有些地方能對你有參考價值。

下一篇：只用 Prompt 和技能也能做好治理：實用技巧與範例

Agentic OS 是開源專案，歡迎看看我們怎麼實作，也歡迎指出你覺得不對的地方：github.com/KbWen/agentic-os