TL;DR: The governance problems that make AI agents unpredictable (unverified completions, state loss between sessions, unconstrained scope) are structurally identical to problems distributed systems engineering solved with audit logs, delivery acknowledgment, state machines, and least-privilege access. The one genuine difference is non-determinism: an agent given the same open-ended task twice will do something different, which means governance needs to front-load constraints rather than just catch failures after. But the rest of the pattern library applies directly.


If you have built a message queue, you have hit a version of this bug: a worker picks up a job, does the work, then fails before sending the acknowledgment. The queue marks it undelivered. The job runs again. Now you have a duplicate record, a double email, or worse, depending on what “the job” was.

The fix is well-understood: require the worker to produce evidence of completion that the system can verify externally. Don’t trust the worker’s internal state. Trust the artifact.

When an AI agent says “done” and you have no artifact to check against, that’s the same design gap. The previous post in this series has a concrete example: the agent said the feature was done, I asked for the commit SHA, there wasn’t one, and two of the three modules it described implementing hadn’t changed. A capability failure looks like wrong reasoning. This was neither: the agent completed exactly what it was given, through its own completion criterion, because no external one existed. The fix is in the surrounding structure.

Distributed systems already solved the worker-reliability problem. The patterns map directly.


What agent execution looks like from the outside

Strip the language model out for a moment. What’s left?

A task arrives. A worker picks it up, performs operations, and signals completion. The orchestrator decides what to do next.

Standard async task pipeline. The governance questions are the same ones distributed systems have always asked: Did the work actually happen? What state is the system in now? What was the worker allowed to touch?

The answers (delivery acknowledgment, audit logs, state machines, capability sandboxing) aren’t novel. They exist because systems without them fail in predictable, documented ways. Agent deployments running without that structure encounter the same failure modes.


The pattern mapping

Distributed systems pattern Agent governance equivalent
Delivery acknowledgment Every task completion requires an external verifiable artifact: commit SHA, test output, file path
Idempotency key Task dispatch is deduplicated: same task classified and scoped the same way, regardless of retry
Audit log / event sourcing Work Log: decisions recorded at the time they happen, not reconstructed from memory later
State machine with explicit transitions Phase gate: plan before implementing, review before shipping, with real entry/exit conditions
Least privilege / capability sandbox Agent’s tool access scoped to what the specific task requires, not everything available
Resource quota Task classification that routes work to an appropriately sized execution path before it begins

The Agentic OS framework is essentially this table implemented as a working system, not because it invented these patterns, but because building it kept arriving at the same structural answers distributed systems already had. The evidence requirement feels new until you recognize it as a CI gate. The work log feels novel until you recognize it as event sourcing. The insight isn’t original; it’s just overdue.


The one place the analogy breaks

Distributed systems assume deterministic workers. Same input, same output, retry is safe.

Agents aren’t deterministic, at least not for open-ended tasks. The same prompt, the same tools, the same context: execution goes somewhere different. Sometimes better. Often just different. For well-scoped sub-tasks (“run these tests and report failures,” “format this JSON to this schema”), retry still works fine. But for the tasks where governance matters most (feature implementation, refactoring decisions, scope-touching work), retry isn’t a recovery strategy; it’s another roll.

This is what Pete Hodgson’s analysis of AI coding tools points toward: when a problem has many valid solutions, the probability that an agent independently lands on the one you wanted approaches zero. The governance implication is that task decomposition is itself a governance act. Break work into pieces small enough that non-determinism is contained. Then front-load the constraints on the pieces that remain open-ended: define what “done” means, specify which files are in scope, classify the task before the first tool call.

The circuit breaker in distributed systems stops a cascade after failures accumulate. The agent equivalent is not letting the cascade start.


Where to instrument

Distributed systems tell you to instrument at the transition points: message intake, worker pickup, task completion, downstream dispatch. These are where state changes happen and where failures manifest.

The agent equivalent:

  • Task intake: Is this classified correctly? What phase path follows? What tools does it need, and only those?
  • Phase completion: What artifact exists to prove this phase is done? Is it external to the conversation?

The third transition point is worth more than a bullet. Session boundary is the agent-specific failure mode that has no clean distributed-systems equivalent: it’s closer to a stateless worker that loses its in-memory state and reprocesses from the queue head on restart. An IEEE Spectrum report on AI coding tools documented the pattern: in longer sessions, agents increasingly regenerated functions that already existed and ignored conventions established earlier. The fix is identical to the queue case: persistent state external to the worker. In agent terms: a work log that records decisions at the time they’re made, so the next session inherits context instead of reconstructing it.


Which gaps cost the most

The distributed systems frame doesn’t just explain why agent governance looks the way it does — it tells you which gaps cost the most.

Missing completion verification produces the cheapest failures: you find out fast. Missing scope constraints produce the expensive ones: the agent did three things you didn’t ask for, two of which were correct, and now you’re auditing which is which. Missing session state produces the hidden ones: the agent solved a problem you already solved, using a pattern you already decided against, because it had no way to know.

If you’re choosing where to add structure first: start with scope. The task intake gate is the circuit breaker — it constrains what the agent can reach before it runs. The work log is the audit trail you need after something goes wrong. The completion artifact is the acknowledgment the queue was never getting.

Add them in that order.

This post is part of a series on building real AI systems. The previous post, Why AI Agents Go Wrong: It’s Not the Model, covers the capability vs. governance failure taxonomy that motivates this framing. Next: No Evidence, No Completion takes the evidence requirement as a standalone principle and shows what it looks like in practice. The framework is open source at github.com/KbWen/agentic-os.