<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Distributed Systems on KbWen Blog</title>
    <link>https://www.kbwen.com/tags/distributed-systems/</link>
    <description>KbWen 的個人技術部落格，分享 Python、機器學習、深度學習、資料工程與 AI 開發的學習筆記與實作心得。</description>
    <generator>Hugo</generator>
    <language>zh-tw</language>
    <image>
      <url>https://www.kbwen.com/images/og-default.png</url>
      <title>KbWen Blog</title>
      <link>https://www.kbwen.com/</link>
    </image>
    
    <lastBuildDate>Fri, 22 May 2026 16:00:00 +0800</lastBuildDate><atom:link href="https://www.kbwen.com/tags/distributed-systems/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Prior art: what distributed systems already knows</title>
      <link>https://www.kbwen.com/ai-agent-governance-distributed-systems-prior-art/</link>
      <pubDate>Fri, 22 May 2026 16:00:00 +0800</pubDate><dc:creator>KbWen</dc:creator>
      <guid>https://www.kbwen.com/ai-agent-governance-distributed-systems-prior-art/</guid>
      <description>AI agent governance maps onto distributed systems patterns: audit logs, delivery acknowledgment, idempotency, least privilege. The prior art already exists.</description>
      <content:encoded><![CDATA[<blockquote>
<p><strong>TL;DR:</strong> The governance problems that make AI agents unpredictable (unverified completions, state loss between sessions, unconstrained scope) are structurally identical to problems distributed systems engineering solved with audit logs, delivery acknowledgment, state machines, and least-privilege access. The one genuine difference is non-determinism: an agent given the same open-ended task twice will do something different, which means governance needs to front-load constraints rather than just catch failures after. But the rest of the pattern library applies directly.</p>
</blockquote>
<hr>
<p>If you have built a message queue, you have hit a version of this bug: a worker picks up a job, does the work, then fails before sending the acknowledgment. The queue marks it undelivered. The job runs again. Now you have a duplicate record, a double email, or worse, depending on what &ldquo;the job&rdquo; was.</p>
<p>The fix is well-understood: require the worker to produce evidence of completion that the system can verify externally. Don&rsquo;t trust the worker&rsquo;s internal state. Trust the artifact.</p>
<p>When an AI agent says &ldquo;done&rdquo; and you have no artifact to check against, that&rsquo;s the same design gap. <a href="/why-ai-agents-fail-without-governance/">The previous post in this series</a> has a concrete example: the agent said the feature was done, I asked for the commit SHA, there wasn&rsquo;t one, and two of the three modules it described implementing hadn&rsquo;t changed. A capability failure looks like wrong reasoning. This was neither: the agent completed exactly what it was given, through its own completion criterion, because no external one existed. The fix is in the surrounding structure.</p>
<p>Distributed systems already solved the worker-reliability problem. The patterns map directly.</p>
<hr>
<h2 id="what-agent-execution-looks-like-from-the-outside">What agent execution looks like from the outside</h2>
<p>Strip the language model out for a moment. What&rsquo;s left?</p>
<p>A task arrives. A worker picks it up, performs operations, and signals completion. The orchestrator decides what to do next.</p>
<p>Standard async task pipeline. The governance questions are the same ones distributed systems have always asked: Did the work actually happen? What state is the system in now? What was the worker allowed to touch?</p>
<p>The answers (delivery acknowledgment, audit logs, state machines, capability sandboxing) aren&rsquo;t novel. They exist because systems without them fail in predictable, documented ways. Agent deployments running without that structure encounter the same failure modes.</p>
<hr>
<h2 id="the-pattern-mapping">The pattern mapping</h2>
<table>
  <thead>
      <tr>
          <th>Distributed systems pattern</th>
          <th>Agent governance equivalent</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Delivery acknowledgment</td>
          <td>Every task completion requires an external verifiable artifact: commit SHA, test output, file path</td>
      </tr>
      <tr>
          <td>Idempotency key</td>
          <td>Task dispatch is deduplicated: same task classified and scoped the same way, regardless of retry</td>
      </tr>
      <tr>
          <td>Audit log / event sourcing</td>
          <td>Work Log: decisions recorded at the time they happen, not reconstructed from memory later</td>
      </tr>
      <tr>
          <td>State machine with explicit transitions</td>
          <td>Phase gate: plan before implementing, review before shipping, with real entry/exit conditions</td>
      </tr>
      <tr>
          <td>Least privilege / capability sandbox</td>
          <td>Agent&rsquo;s tool access scoped to what the specific task requires, not everything available</td>
      </tr>
      <tr>
          <td>Resource quota</td>
          <td>Task classification that routes work to an appropriately sized execution path before it begins</td>
      </tr>
  </tbody>
</table>
<p>The <a href="https://github.com/KbWen/agentic-os">Agentic OS framework</a> is essentially this table implemented as a working system, not because it invented these patterns, but because building it kept arriving at the same structural answers distributed systems already had. The evidence requirement feels new until you recognize it as a CI gate. The work log feels novel until you recognize it as event sourcing. The insight isn&rsquo;t original; it&rsquo;s just overdue.</p>
<hr>
<h2 id="the-one-place-the-analogy-breaks">The one place the analogy breaks</h2>
<p>Distributed systems assume deterministic workers. Same input, same output, retry is safe.</p>
<p>Agents aren&rsquo;t deterministic, at least not for open-ended tasks. The same prompt, the same tools, the same context: execution goes somewhere different. Sometimes better. Often just different. For well-scoped sub-tasks (&ldquo;run these tests and report failures,&rdquo; &ldquo;format this JSON to this schema&rdquo;), retry still works fine. But for the tasks where governance matters most (feature implementation, refactoring decisions, scope-touching work), retry isn&rsquo;t a recovery strategy; it&rsquo;s another roll.</p>
<p>This is what <a href="https://blog.thepete.net/blog/2025/05/22/why-your-ai-coding-assistant-keeps-doing-it-wrong-and-how-to-fix-it/">Pete Hodgson&rsquo;s analysis of AI coding tools</a> points toward: when a problem has many valid solutions, the probability that an agent independently lands on the one you wanted approaches zero. The governance implication is that task decomposition is itself a governance act. Break work into pieces small enough that non-determinism is contained. Then front-load the constraints on the pieces that remain open-ended: define what &ldquo;done&rdquo; means, specify which files are in scope, classify the task before the first tool call.</p>
<p>The circuit breaker in distributed systems stops a cascade after failures accumulate. The agent equivalent is not letting the cascade start.</p>
<hr>
<h2 id="where-to-instrument">Where to instrument</h2>
<p>Distributed systems tell you to instrument at the transition points: message intake, worker pickup, task completion, downstream dispatch. These are where state changes happen and where failures manifest.</p>
<p>The agent equivalent:</p>
<ul>
<li><strong>Task intake</strong>: Is this classified correctly? What phase path follows? What tools does it need, and only those?</li>
<li><strong>Phase completion</strong>: What artifact exists to prove this phase is done? Is it external to the conversation?</li>
</ul>
<p>The third transition point is worth more than a bullet. <strong>Session boundary</strong> is the agent-specific failure mode that has no clean distributed-systems equivalent: it&rsquo;s closer to a stateless worker that loses its in-memory state and reprocesses from the queue head on restart. <a href="https://spectrum.ieee.org/ai-coding-degrades">An IEEE Spectrum report on AI coding tools</a> documented the pattern: in longer sessions, agents increasingly regenerated functions that already existed and ignored conventions established earlier. The fix is identical to the queue case: persistent state external to the worker. In agent terms: a work log that records decisions at the time they&rsquo;re made, so the next session inherits context instead of reconstructing it.</p>
<hr>
<h2 id="which-gaps-cost-the-most">Which gaps cost the most</h2>
<p>The distributed systems frame doesn&rsquo;t just explain why agent governance looks the way it does — it tells you which gaps cost the most.</p>
<p>Missing completion verification produces the cheapest failures: you find out fast. Missing scope constraints produce the expensive ones: the agent did three things you didn&rsquo;t ask for, two of which were correct, and now you&rsquo;re auditing which is which. Missing session state produces the hidden ones: the agent solved a problem you already solved, using a pattern you already decided against, because it had no way to know.</p>
<p>If you&rsquo;re choosing where to add structure first: start with scope. The task intake gate is the circuit breaker — it constrains what the agent can reach before it runs. The work log is the audit trail you need after something goes wrong. The completion artifact is the acknowledgment the queue was never getting.</p>
<p>Add them in that order.</p>
<p><em>This post is part of a series on building real AI systems. The previous post, <a href="/why-ai-agents-fail-without-governance/">Why AI Agents Go Wrong: It&rsquo;s Not the Model</a>, covers the capability vs. governance failure taxonomy that motivates this framing. Next: <a href="/no-evidence-no-completion-verification-principle/">No Evidence, No Completion</a> takes the evidence requirement as a standalone principle and shows what it looks like in practice. The framework is open source at <a href="https://github.com/KbWen/agentic-os">github.com/KbWen/agentic-os</a>.</em></p>
]]></content:encoded>
    </item>
    
  </channel>
</rss>
