TL;DR: “No evidence, no completion” is a single structural principle: a task isn’t done until the agent produces an artifact that exists outside the conversation and can be checked independently. It sounds trivial. In practice it closes most of the common agent failure modes in one rule, because the act of specifying what evidence looks like, before the task runs, forces you to define what “done” actually means.
In the previous post in this series I described an agent that said a feature was done (commit SHA requested, none existed, two of three modules unchanged). The failure had a name: no external completion criterion existed, so the agent supplied its own. That gap has a one-rule fix.
What “evidence” means here
Evidence is any artifact that exists outside the conversation and can be verified independently of what the agent said.
A commit SHA is evidence. A test output is evidence. A file path with a checksum is evidence. A screenshot of a passing CI run is evidence.
“I implemented it” is not evidence. “The feature is working” is not evidence. A description of what the agent did is not evidence: it’s the agent’s own assessment of its work, which is exactly what you’re trying to verify.
The distinction matters because conversation text is not auditable. It exists only within the session, can’t be pointed to by anyone who wasn’t there, and doesn’t prove the underlying state of the system. An artifact external to the conversation can be checked at any time, by anyone, against the actual state.
Why one rule covers so much
The first post in this series catalogued five structural gaps: no completion criterion, no phase gate, no state handoff, no resource scoping, no capability boundary. The evidence principle doesn’t replace all of them, but it forces the most important one: you cannot specify what evidence looks like without first deciding what “done” means.
If the evidence for a feature task is “passing tests + commit SHA on the feature branch,” you’ve implicitly defined the completion criterion, the scope boundary (the feature branch, not the main codebase), and a checkpoint for the phase gate. The evidence requirement is the handle that pulls the rest of the structure into place.
This is why the distributed systems framing maps so cleanly: delivery acknowledgment in a message queue is exactly this pattern. The queue doesn’t trust the worker’s internal state; it requires an external signal that the job completed. Decades of production systems run on that principle because systems without it fail in the same predictable way.
Before the task, not after
The principle works when it’s applied before the task starts, not as a review step after.
“What would prove this is done?” asked before the work begins forces a design decision. It’s not a check on the agent — it’s a check on the task specification. If you can’t answer it, the task isn’t specified well enough to run. If you can answer it but the answer is vague (“the feature works”), the vagueness is in your specification, not in the agent’s execution.
This is the mechanism Pete Hodgson’s analysis of AI coding tools points toward: when a problem has many valid solutions, the agent will pick one. That one will probably be valid. It probably won’t be the one you wanted. Specifying evidence before the task runs is a way of narrowing the solution space — the agent’s output has to satisfy the evidence criterion, which eliminates the paths that don’t.
In practice: “implement email verification” with no evidence criterion produces one kind of output. “Implement email verification — done when: (1) tests pass for OTP generation and expiry, (2) commit SHA on feat/email-verification” produces a different one. Same model. Different structure around it.
What good evidence looks like
Evidence should be:
External to the conversation. It can be retrieved or verified by someone who wasn’t in the session. A commit SHA can be looked up. A test output can be reproduced. A URL can be visited.
Specific enough to be falsifiable. “Tests pass” is weaker than “running npm test returns exit 0 with 47 tests passing.” The second can be false in a way that “tests pass” can’t — which is the point. If the evidence criterion can’t be falsified, it’s not doing the work.
Proportional to the task. A one-line bug fix doesn’t need a full audit trail. The evidence for a tiny fix is the commit SHA and a grep confirming the old string is gone. The evidence for a feature touching auth, API, and database schema is more involved: test output, migration SHA, API contract diff. The Agentic OS framework classifies tasks before they run partly to route to the appropriate evidence format: a quick-win task and an architecture-change task need different levels of proof.
The cost of specifying evidence
Specifying evidence costs something up front. It takes maybe two minutes to think through “what would prove this is done” before a task starts. That’s real overhead.
The comparison is with recovery cost. A governance failure (completing a task that didn’t actually complete, or completing it the wrong way) typically costs: discovering the error, rebuilding context, rerunning the work, and auditing scope. None of those costs are bounded. The two minutes up front is.
The Agentic OS v1.1 benchmark (April 2026, using chars/4 as the token estimation formula, ±10%) measured governance overhead for a quick-win task at roughly 17,000 tokens: the cost of the full structured lifecycle, evidence requirement included. For a complex feature spanning API design, auth, and database schema, it’s around 51,000 tokens. Those numbers are real costs. They’re also the ceiling. The cost of an undetected wrong completion has no ceiling — it depends on when you find it and how much work built on top of it.
The question to ask before your next task
Before you give an agent its next task: what artifact would prove this is done?
Not “what would it mean to be done” — that’s vague enough for the agent to fill in. What specific artifact, external to the conversation, would you point to afterward and say: here is the evidence this completed correctly.
If you have an answer, you have a completion criterion. If you don’t, you’re delegating the definition of “done” to the agent. It will define one. It almost never matches yours.
This post is part of a series on building real AI systems. The previous posts cover the two-failure taxonomy and the distributed systems prior art that motivates the evidence requirement. The framework is open source at github.com/KbWen/agentic-os.