TL;DR: An agent skill behaves predictably to about the degree its boundary is specified. Described as a capability (“the agent can now do X”), a skill tends to drift. Described as a contract (declared inputs, declared outputs, a scope it promises not to exceed), it behaves more like a well-designed API. The interface-design habits engineers already have (stable contracts, explicit scope, versioning) seem to transfer directly. The framework’s own direction points the same way: skills as versioned, capability-bounded packages, with boundary enforcement instead of micromanaging how the model reasons.


Every team that has shipped a public API has lived through the same lesson: an endpoint that does “roughly what you’d expect” is a liability. The contract (what it accepts, what it returns, what it won’t touch) is the product. The implementation behind it is replaceable. Agent skills seem to be arriving at the same lesson, just faster.

A skill, in Anthropic’s Agent Skills standard (now shared across Claude Code, Codex, and Cursor), is a SKILL.md file: instructions plus frontmatter that controls when it loads and who invokes it. That format makes it tempting to think of a skill as “a prompt the agent can reuse.” That framing is where a lot of the unpredictability starts.

A skill is a contract, not a capability list

The usual way to describe a skill is by capability: “this one runs the test suite,” “this one deploys.” It reads naturally and it sets you up to be surprised. A capability has no edges. “Can deploy” says nothing about what it will read, what it will change, or what it will do when the situation doesn’t match the happy path.

A contract has edges by construction. It names what goes in, what comes out, and what stays out of reach. The framework’s own architecture notes put it more bluntly than I would: skills should be treated as “versioned, policy-governed, capability-bounded packages,” not “loose prompt files.” The same idea runs through what makes a skill different from a prompt: the value is in the specification, not the raw capability.

The principles that transfer

If a skill is an interface, then the design vocabulary engineers already use mostly carries over:

Interface concept Skill design equivalent
Request / response schema Declared inputs and declared outputs
Endpoint scope (touches these resources, not those) Tool, path, and network boundaries
Versioning (semver) A skill version, so callers know what they’re getting
Breaking change Scope drift: the skill quietly starts doing more
Deprecation An explicit lifecycle state, not silent rot

None of this is novel to anyone who has designed an interface. The only move is recognizing that a skill is one, and that “be more careful” is not a substitute for a specified boundary—the same way “use the API responsibly” was never a substitute for a schema.

Scope drift is a breaking change

The most common failure I see isn’t a skill that can’t do its job. It’s a skill that does its job plus two things nobody asked for, because its scope was never drawn.

The framework’s design principle for this is worth borrowing: boundary enforcement over behavior micromanagement. You don’t fix a vague skill by adding more instructions about how to reason well. You constrain what it can reach—which files, which tools, which budget. A skill whose scope widens over time is shipping a breaking change with no version bump, and the caller (you) finds out the way callers always find out: in production. This is the same capability-boundary gap behind several of the common agent pitfalls; scope is just the place it shows up most expensively.

The boundary has a token price too

There’s a second reason to draw the edge tightly, and it connects to cost. Skills load progressively: the agent reads a skill’s small metadata to decide whether it’s relevant, and only pulls the full body when it is. A tightly-bounded skill is cheaper to ignore: its contract is small and quick to probe, where a sprawling one drags its whole body into context to find out it didn’t apply. Interface discipline and token discipline end up pointing in the same direction: a clear, small contract is both more predictable and less expensive.

A skill with a fuzzy boundary isn’t really a capability. It’s an undocumented API, and it will surprise you for the same reasons undocumented APIs always have. I don’t think the right contract for a given skill is obvious up front—I usually find it by watching where a skill oversteps and drawing the line there. But the framing has held up: define what it promises, define what it won’t touch, and treat a change to either as the breaking change it is.

This post is part of a series on building real AI systems. Related: What Makes an AI Skill Different from a Prompt? and Beyond Prompts: From Giving Instructions to Building Systems. A Chinese companion piece on skill boundaries is Skill 邊界設計:從能力到合約. The framework is open source at github.com/KbWen/agentic-os.