TL;DR: An agent’s “done” sounds confident whether the work happened or not. So don’t judge whether to trust the sentence. Flip the burden: completion means the agent shows you something you can check yourself, sized to the task. It doubles as a spec check: if you can’t say what would prove a task is done, you haven’t specified it well enough to hand off.
You ask an agent to change something. A minute later it comes back: “Done. I updated the validation logic in authService, added handling for token expiry, and covered the edge cases.”
It reads like a report from someone who finished the work, which is exactly why it’s so easy to wave through.
The same “done” can sit on top of three very different states. The work happened. The work half-happened and the agent routed around the part that didn’t. Or the agent misread the request and confidently built the wrong thing. In the text those three are almost indistinguishable, because fluent text is what a language model produces by default: “I did A, B, and C” reads just as smoothly whether or not A, B, and C occurred. The sentence is a description of the work, generated by the same system whose work you’re trying to check.
With people, we let that slide for a good reason. Someone who can explain a problem cleanly has usually thought it through, so fluent explanation tracks competence and we trust it. With a model the link breaks: fluency is its native output, and whether the work is right is a separate thing the smooth sentence tells you almost nothing about. There’s a second pull in the same direction. In my own use, the unprompted report almost always closes the task as success — you rarely get a volunteered “I couldn’t finish this part.” So the reporting is fluent and tilted toward “done,” which makes a confident claim of success the easiest possible thing to accept.
I’ve seen the gap up close. The opening case in Why AI Agents Go Wrong was exactly this: the agent reported a feature done, there was no commit SHA behind it, and two of the three modules it described changing were untouched. Fluent report, partial work, and nothing in the sentence told them apart.
Flip the burden of proof
A better lie detector won’t help here. Move the burden instead: don’t trust the claim by default, and treat completion as something the agent demonstrates with an artifact you can check yourself: the test result, the changed file, the list of sources.
If the agent says it tested something, have it rerun the command and paste the output. If it says it changed a file, ask which file and which lines, and read the diff. If it says it “researched ten sources and summarized them,” ask for the ten links. The thing you can point at is the unit of completion, not the sentence about it.
Asking surfaces more than the artifact
Here’s the part that’s easy to miss, and the reason the habit earns its keep beyond catching outright lies.
Press for the test output and you sometimes get: “actually, the tests aren’t running yet, there’s a setup issue.” That admission was available the whole time. It didn’t appear in the unprompted report because the report just summarized the work after the fact. Asking for the artifact forces the agent to actually attempt the thing it described, and the attempt is where the setup failure surfaces, because now the failing step is in front of the agent instead of compressed into a past-tense “done.”
So requesting evidence does a second job. It surfaces the thing the agent mentions while assembling the proof — the part it would otherwise have quietly stepped over.
Size the proof to the task
The proof has to fit the task, or you won’t keep doing it. Different task sizes need different proof:
- A typo fix: one grep, is the old string gone? One line.
- A feature: the test output with the actual numbers (how many passed, how many failed), not the words “tests pass.”
- A refactor: a diff plus the existing tests still passing, because behavior holding constant is the whole definition of a refactor; if the tests went red, you broke something and didn’t notice.
- A schema migration: the migration log line plus a query against the migrated table coming back in the new shape, run against a real database, not a dry run.
That’s why Agentic OS sorts a task into tiers (tiny-fix, quick-win, feature, hotfix, architecture-change) before it starts: the proof a typo fix owes you isn’t the proof a migration owes you, and classifying up front lets the expected evidence match the work.
Good proof is the kind you could catch being wrong. “Tests pass”: you can’t check that. “Ran npm test, 47 passed, exit 0”: you can, in a specific way, and that’s what makes it worth anything. A claim you couldn’t catch being wrong hasn’t done any verification work; it’s just “trust me” relocated to a new sentence. (The structural version of this, proof external to the conversation and proportional to the task, is the whole argument in No evidence, no completion. This post is the behavior underneath it.)
What would prove this is done?
“What would prove this is done?” looks like a question about the agent. What it actually probes is your own task definition.
If you can’t answer it, if you can’t say what finished looks like, that’s not the agent being slippery. The task isn’t specified well enough to hand off. Which is why the habit earns most of its keep before the work starts, not as an after-the-fact catch. Ask it up front: when this is done, what will I point at to call it done? Answer that and you have a completion criterion. Skip it and you’ve handed the agent the definition of “done.” It will pick one, and it almost never matches the one in your head.
Where the habit holds, and where it doesn’t
This only catches problems you knew to look for. An agent failing somewhere you never thought to check slips straight past it, and that gap is closed by knowing the task well, not by asking for evidence.
What asking does catch is the ordinary middle: work that’s wrong in a checkable way, whether or not anything felt off. My read is that this middle is where most bad “done"s live, and catching a good chunk of them for the price of one question is already a good trade.
That trade is also why the proof has to stay small. A check you can’t sustain stops being a check at all; an over-heavy one is the one you quietly stop running. (Same bounded-cost reasoning as the token economics of governance: you pay a small known cost to bound an unknown one.)
When asking every time gets tiring, that’s the moment people reach for automation. An evidence gate in a framework is just this question turned into a step that runs by default, with the receipts recorded and hash-chained into an append-only audit log so they can’t be walked back (how to make AI agents follow the process covers that part). The habit here is the step before the gate, the moment when no receipt has been demanded yet and trusting the “done” is entirely a judgment call.
So I’ve stopped treating it as a checkpoint and more as a small reflex. Before you accept a “done,” name the one thing you’d point at — then make it show you.
Agentic OS is open source: github.com/KbWen/agentic-os
Read next
- No evidence, no completion — the structural sibling: completion as a principle that requires a verifiable artifact, and how it pulls scope and checkpoints in behind it
- Why AI Agents Go Wrong: It’s Not the Model — the case that opens this whole thread: a fluent “done” with no commit SHA behind it
- How to make AI agents follow the process — once a receipt is demanded, how it gets hash-chained into an append-only audit log so it can’t be walked back



