Skip to main content
kellerai.blog

The Assumption You Didn't Write Down

How invisible SDK behavioral contracts become production incidents.

KellerAI White Paper · In-Depth · Dependency & Contracts · May 2026 · ~29 min read

Context

On a routine Tuesday, an SDK upgrade went out. Type checks passed. The full unit-test suite passed. Linters were clean. CI was green. The deployment promoted to production, & within minutes the agent runtime started crashing on every conversation that issued a tool call. The crash signature was a RuntimeError raised from a callback the team had written months earlier & not touched since. The fix was a one-line revert. The incident closed in hours. The assumption that caused it remained, expressed not as a comment or a test, but as a version pin in pyproject.toml.

The assumption was this: that the SDK's can_use_tool callback fires only when the agent invokes AskUserQuestion. Under SDK version 0.1.52 & earlier, that assumption was true. Under 0.1.53 & later, the same callback fires for every tool the agent calls. The type signature did not change. The function name did not change. The argument shapes did not change. What changed was the behavioral dispatch contract, & Python's type system does not encode behavioral dispatch contracts.

The Finding

Behavioral contracts — the dispatch, payload, ordering, frequency, & authorization-delegation assumptions that surround every SDK callback — are production hazards that are systematically invisible to type checkers, linters, & unit tests, & the gap is structural rather than accidental. This paper diagnoses five contract types & walks through the anatomy of the can_use_tool case study line by line, surveys how Java, Rust, Go, & gRPC handle similar problems, proposes five detection signals that reliably indicate the presence of invisible contracts, & outlines five mitigation strategies: explicit guards with log-and-alert, TypedDict payloads with explicit key checks, behavioral contract tests per callback, a behavioral changelog, & upper-bound pins with audit annotations.

The pattern is universal: any function invoked by a framework rather than by the application has behavioral contracts that exceed its type signature, & the cost of skipping the mitigations is unbounded, paid in increments, on a schedule the SDK author chooses.

Tags:
SDK Behavioral ContractsInvisible Assumptions in CodeProduction Reliability
Cite this paper

KellerAI. (2026, May 20). The Assumption You Didn't Write Down: In Depth. KellerAI. https://kellerai.blog/assumption-you-didnt-write-down-in-depth

Paper Details
CategoryDependency & Contracts
AudienceEngineering teams, platform architects, & SDK maintainers responsible for observable & reliable agent system design
MethodIncident case study (KellerAI internal, kwp-isha.10, line-by-line analysis) + taxonomy of five contract types + five detection patterns + industry survey (Java @Contract, Rust unsafe, Go io.Reader, gRPC proto) + five mitigation strategies + regulatory overlay (SR 11-7, NIST AI RMF) + honest limits
Length~7,000 · ~29 min
Reading levelTechnical
Sections14
References23
Versionv1.0 · Updated May 2026
PublishedMay 2026
Key Takeaways
  • Behavioral contracts — dispatch, payload, ordering, frequency, & authorization-delegation assumptions — are invisible to type systems & survive every conventional check (types, linters, unit tests) until production.
  • The taxonomy of five contract types, combined with five detection signals (type-ignore markers, bare dict subscript access, raise statements in callbacks, exact-version pins, hedging language in comments), converts an invisible risk category into a finite auditable list.
  • Five mitigations — log-and-alert guards, TypedDict payloads, behavioral contract tests, a behavioral changelog, & upper-bound pins — surface invisible contracts before they reach production, at a cost of roughly one day of focused engineering work per codebase.
Related
Placeholder — pending analytics
Section 01

Executive Summary

On a routine Tuesday, an SDK upgrade went out. Type checks passed. The full unit-test suite passed. Linters were clean. CI was green. The deployment promoted to production, and within minutes the agent runtime started crashing on every conversation that issued a tool call. The crash signature was a RuntimeError raised from a callback the team had written months earlier and not touched since. The fix, recorded as commit 65f507a3, was a one-line revert. The incident closed in hours. The assumption that caused it remained, expressed not as a comment or a test, but as a version pin in pyproject.toml.

The assumption was this: that the SDK's can_use_tool callback fires only when the agent invokes AskUserQuestion. Under SDK version 0.1.52 and earlier, that assumption was true. Under 0.1.53 and later, the same callback fires for every tool the agent calls. The type signature did not change. The function name did not change. The argument shapes did not change. What changed was the behavioral dispatch contract, and Python's type system does not encode behavioral dispatch contracts. There was no place for the change to be detected automatically, because there was no place where the original behavior had been written down.

This paper argues that behavioral contracts — the dispatch, payload, ordering, and frequency assumptions that surround every SDK callback — are production hazards that are systematically invisible to type checkers, linters, and unit tests. We use the can_use_tool incident as a running case study, walk through five related patterns in the same codebase, survey how other language ecosystems handle the same problem, and propose detection and mitigation patterns that surface invisible contracts before they reach production.

A behavioral assumption that lives only in your head is a production incident with no fixed date.

The thesis
Section 02

The Silent Breakage Problem

The phrase "behavioral contract" is older than the SDKs we use today, but it has narrowed in the type-system literature to mean roughly what design-by-contract advocates 21 meant by it: the preconditions, postconditions, and invariants that a function promises but does not encode in its signature. We use it here in a broader, more operational sense: a behavioral contract is any property of a function or callback that callers depend on, that is not expressible in the type system the language gives them .

A type contract says: this function takes a str and a dict and returns a PermissionResult. A behavioral contract says: this function is invoked exactly once per user-facing question, only when the agent decides to ask, and never for ordinary tool calls. The first is checkable. The second is folklore.

Python's type system, even with the modern type-hint stack, enforces shape and not dispatch. A Callable[[str, dict, ToolPermissionContext], Awaitable[PermissionResult]] describes what arguments arrive and what comes back. It does not describe when the callable is invoked, which tools cause it to fire, how often it can be called within a single agent turn, or what keys the dict is expected to contain. Each of those is a separate, undocumented contract negotiated between the SDK author and the SDK user, and renegotiated — often silently — at every release.

A taxonomy of behavioral contracts

  1. 01Dispatch contract. Which events or conditions cause the callback to fire. The can_use_tool change is a dispatch contract change.
  2. 02Payload contract. Which keys, of which types, will be present in dict-shaped arguments. Stringly-typed input_data["questions"] access is a payload contract.
  3. 03Ordering contract. The sequence in which a series of callbacks fires for a single logical operation — for example, "pre-tool" before "post-tool", or "session-start" before any "message" callback.
  4. 04Calling-frequency contract. How many times a callback can fire per logical operation. Idempotency assumptions live here.
  5. 05Authorization-delegation contract. Which downstream component is the authoritative authorizer when a callback is reached. A change to which hook owns which tool — invisible at the callback's type boundary — silently re-routes authorization.

None of these contracts are checkable by mypy, pyright, ruff, or any production-grade Python tool we are aware of. They are checkable, in principle, by integration tests that exercise the real SDK against real inputs — but only if the test author knew the contract existed in the first place. The contracts that hurt are the ones nobody wrote down, because nobody noticed they were making them.

Section 03

Anatomy of an Invisible Contract

The clearest example in the codebase is the can_use_tool callback registered with the Claude Agent SDK 12 . The relevant code lives in the agent callbacks module of KellerAI's internal codebase (lines 58–69):

callbacks.py:58–69 — can_use_tool
python
async def can_use_tool(
  self,
  tool_name: str,
  input_data: dict,
  context: ToolPermissionContext,
) -> PermissionResultAllow | PermissionResultDeny:
  """SDK callback. Interrupts on first call, delivers answers on re-call."""
  del context
  logger.debug(f"can_use_tool called for {tool_name}")

  if tool_name != "AskUserQuestion":
      raise RuntimeError(f"can_use_tool called for unexpected tool: {tool_name}")

Read this code in isolation and it is unremarkable. The type signature is fully annotated. The discriminated return type ( PermissionResultAllow | PermissionResultDeny) is precise. The docstring is present. A reviewer scanning a pull request that introduced this function would have nothing to flag.

The behavioral contract, however, is encoded only in the guard on line 68. The author believed — correctly, at the time of writing — that the SDK invoked can_use_toolonly when the agent was about to issue an AskUserQuestion tool call. Under that assumption, the raise RuntimeError is not paranoia; it is a defensive assertion that the SDK has not invoked the callback under any other circumstance, and that if it ever does, the program should fail loudly rather than silently grant or deny permission for a tool nobody anticipated.

This assumption was true in SDK versions 0.1.52 and earlier. Between 0.1.52 and 0.1.53 12 , the SDK's dispatch contract widened: can_use_tool began firing for every tool invocation, not only AskUserQuestion. The function name did not change. The argument types did not change. The return type did not change. The semantically critical change — which events cause this callback to fire — was not visible in any artifact that a type checker could read.

The original commit 0a3b81a3 upgraded the SDK pin. The application started. The first agent turn that invoked any tool — a read_file, a run_bash, anything — caused can_use_tool to fire with that tool's name, the guard on line 68 fired, and the runtime crashed. Commit 65f507a3 reverted the pin and closed the incident.

It is worth pausing on the asymmetry of the guard. Under the old contract, raising RuntimeError on an unexpected tool was correct — it would have indicated a genuine SDK bug or a tool name typo and should have halted execution. Under the new contract, the same line of code is catastrophic — it converts every routine tool call into a fatal exception. The line did not change. Its meaning inverted. This is the defining property of an invisible contract: identical code, identical types, opposite behavior, with no syntactic marker that anything has shifted.

Section 04

The Type Stub Gap

Type stubs cover what stubs can cover: the shape of function signatures, the structure of dataclasses, the variance of generics. They do not cover what the function does, how often it is called, or what keys appear in the dicts it returns. When a stub is incomplete — when a return type is Any, or when the maintainer has not yet annotated a piece of the surface — the user fills in the gap by hand. The mechanism for doing so, by convention in the Python ecosystem, is # type: ignore.

hooks.py:101–104 — type: ignore return
python
# The appropriate types for getting this response to match what is expected
# in HookMatches seem to be missing from the claude-agent-sdk package as of
# version 0.1.22
return response  # type: ignore

The comment is honest. It records the SDK version against which the stub gap was observed and the function's purpose. What it does not record is the behavioral assumption the author made when deciding what response should contain. The # type: ignore silences the type checker, but it does not document the assumption that the SDK will accept whatever shape response happens to have.

This pattern is worth naming. In production Python codebases, # type: ignore on a return or an assignment is a reliable marker for a point at which the codebase steps out of the type system and into informal contract territory. Every # type: ignore is a place where the author privately concluded that they understood the SDK's behavior well enough to bypass the checker, and where their understanding is the only enforcement mechanism. Six months later, when the SDK is bumped and the author has moved teams, the # type: ignore becomes a load-bearing memory.

A reasonable heuristic for codebase audits: treat every # type: ignore as a latent behavioral contract that must be re-validated against the current SDK version on every upgrade . The cost is non-trivial — each marker requires reading the relevant SDK source or changelog — but the cost is precisely the cost of the upgrade, made visible.

Section 05

Untyped Payload Contracts

The second category of invisible contract is payload-shape assumption: the codebase reaches into a dict-shaped value provided by the SDK and pulls out a key by name, without any type-level guarantee that the key exists or has the expected value type.

Several instances appear in the same module. At the session message handler in KellerAI's internal codebase (agent module, line 484):

agent.py:484 — bare subscript access
python
session_id = message.data["session_id"]

The message.data attribute is typed loosely (likely dict[str, Any] or similar) and the code accesses session_id by string key. A nearby comment, around line 479, acknowledges that "something has gone wrong with our assumptions about the session lifecycle" — a near-perfect declaration of an undocumented contract whose violation has already been observed.

At the hooks module of KellerAI's internal codebase (line 116):

hooks.py:116 — .get() payload access
python
tool_name = input_data.get("tool_name")

This is slightly safer — .get() returns None rather than raising KeyError — but the behavioral assumption is still present: the caller expects that for the relevant hook event, tool_name will be populated with a string. Downstream code that compares tool_name to a literal will silently fail if the key is absent. At the callbacks module of KellerAI's internal codebase (line 105):

callbacks.py:105 — unguarded subscript
python
input_data["questions"]

This is unguarded subscript access. If the SDK changes the payload schema for the AskUserQuestion permission event — renames the field, nests it under another level, or sends an empty payload during initialization — the line raises KeyError and the agent run halts.

The common thread is stringly-typed dict access on SDK-controlled payloads whose schema is itself an undocumented behavioral contract. The mitigation is straightforward and underused: model SDK payloads as TypedDict with total=False and explicit key existence checks. We return to this in §11.

Section 06

Version Pinning as Behavior Freeze

After the incident, the team made an unusual choice. Rather than fix the underlying assumption — wrap the guard, expand can_use_tool to handle every tool, refactor the callback — they pinned the SDK to the version that worked. From pyproject.toml:16:

pyproject.toml:16 — exact pin
toml
"claude-agent-sdk==0.1.53", # 0.1.53: stable sonnet-only test baseline; 0.1.50+ for RateLimitEvent and Bedrock header fix

The comment documents why the pin floor is at 0.1.50 (a specific feature and a specific bug fix) and why 0.1.53 is the chosen point (a stable test baseline). It does not document why the pin is exact — why ==0.1.53 rather than >=0.1.53,<0.2.0. The answer, which is not written down anywhere, is that 0.1.53 is the last version known to honor the dispatch contract the codebase depends on.

The pin is doing two jobs. As a stability pin, it ensures reproducibility: every CI run, every deploy, every developer's local environment installs the same SDK version. As a behavior freeze, it prevents the silent dispatch widening from striking again. The first job is well understood; the second is invisible. A new engineer reading pyproject.toml sees a normal exact-version pin and has no way to know that bumping it requires re-auditing several files.

This is a category of technical debt the dependency-management ecosystem does not yet name. PEP 440 22 gives us version specifiers; semantic versioning 10 gives us a convention for API compatibility; neither gives us a convention for behavioral compatibility below the patch level. The pin comment documents the minimum version we need for features, but the maximum version we can tolerate is encoded only in the pin's equality operator. If the pin were ever loosened — even accidentally, during a routine dependency cleanup — the incident would recur.

Every SDK release above 0.1.53 is therefore a potential silent breakage spanning multiple files: callbacks.py, hooks.py, agent.py, and pyproject.toml itself. The pin is not a fix; it is a deferral, with no scheduled re-examination.

Section 07

The Accumulation Cost

Each behavioral assumption raises the upgrade tax monotonically. The first time a team writes raise RuntimeError("unexpected tool") in a callback, the audit cost of an SDK upgrade is one file. The second time they write message.data["session_id"], it is two. By the time the codebase has accumulated the four active hazard sites described in this paper, the audit cost of any SDK upgrade is four files plus pyproject.toml itself.

LocationHazard typeDescription
callbacks.py:68can_use_tool dispatch guardraises RuntimeError on any non-AskUserQuestion tool
hooks.py:104# type: ignore returnbehavioral assumption about response shape bypasses type checker
agent.py:302–304global environment-variable mutationcontract about mutation order and isolation
agent.py:484message.data["session_id"]payload contract: assumes key always present

This cost compounds because each site requires not only re-reading the local code but also re-reading the SDK changelog or source for the relevant version range, identifying whether any behavioral contract has shifted, and either updating the local code or confirming that no update is needed. In practice, this work is rarely done in advance. It is done — if it is done at all — as part of the incident response after the upgrade has already broken production.

The pin is, in effect, a deferred renegotiation with the SDK author. Every release the team skips is a contract change they have not read. When the renegotiation finally happens — driven by a security CVE, a required feature, or a deprecation deadline — the team must pay the accumulated cost in one transaction. This is the same dynamic that makes long-lived database schema migrations and long-lived OS upgrade cycles more painful per release than frequent, smaller ones. 6

The mitigation is not "upgrade more often" in the abstract; it is make the upgrade tax visible at the time the assumption is written , so that the cost is amortized rather than spiked. We discuss concrete mechanisms in §11.

Section 08

Why This Is Hard to Test

A reasonable reader will object: surely tests should catch this. They do not, and the reasons are structural.

Unit tests mock the SDK. They have to: invoking the real SDK from a unit test requires a model endpoint, network access, and non-deterministic latency. The mock is constructed by the test author against their understanding of the SDK's behavior — which is to say, against the same mental model that encoded the invisible contract. The mock fires can_use_tool only for AskUserQuestion, because that is what the test author believes the SDK does. The test passes. The contract change does not surface.

The Kai codebase illustrates this exactly. A test file test_kai_callbacks.py exists and contains a test named test_non_ask_user_question_tool_raises that pins the old dispatch contract: it asserts that can_use_tool raises RuntimeError for any non-AskUserQuestion tool. The test is well-written and passes against the current code. What it cannot detect is the SDK-side behavioral change, because the test exercises the application's handler against the application's mental model of the SDK — not against the real SDK's dispatch behavior. After an SDK upgrade that widens dispatch, the test would still pass; it has no way to know the SDK now invokes the callback under additional conditions. The test pins the assumption it should be challenging.

Integration tests can, in principle, run against the real SDK. In practice, they are expensive — slow, flaky, costly to run on every PR — and they are typically scoped to a small number of happy-path flows. The flow that triggers the contract change is a flow the test suite has not been written to cover, because the team did not know it needed to be covered.

Type checking passes by construction: the contract is not in the types. Linters pass: there is no syntactic marker for behavioral assumption. The first real signal of breakage is, almost invariably, a production crash or a careful reading of the SDK changelog. Both are reactive.

Contract testing in the consumer-driven style 23 (Pact and its descendants) offers a partial mitigation: the consumer publishes a contract specifying its expectations, the provider verifies against it, and contract changes become visible at the version boundary. This works well for HTTP APIs and is increasingly used for asynchronous messaging. It has not, to our knowledge, been adapted to in-process SDK callbacks. We treat this as an open area in §11.

Section 09

Industry Patterns

Other language ecosystems have wrestled with the same problem and arrived at partial solutions, none of which fully translate to Python but all of which are instructive.

Java

Uses the JetBrains @Contract annotation 13 to encode a small vocabulary of behavioral contracts directly in source: purity, nullability of inputs and outputs, conditional return values. The annotation is read by IntelliJ's static analyzer and can be exported in formats other tools consume. It does not handle dispatch contracts, but it encodes enough of payload and ordering to be useful in practice. The Java Modeling Language (JML) 14 goes further, offering full design-by-contract preconditions and postconditions, at the cost of more annotation overhead than most teams will accept.

Rust

Does not have a general behavioral contract notation, but uses the unsafe keyword 18 as an explicit marker of behavioral contract territory. Any unsafe block is, by convention, accompanied by a comment specifying the invariants the programmer is asserting. The compiler does not check the invariants, but the marker is mandatory — there is no way to enter unsafe territory accidentally. Trait contracts in Rust are documented in the standard library with conventional # Safety and # Panics sections, which serve as informal but consistently-located behavioral contracts.

Go

Relies almost entirely on interface documentation conventions. The standard library's io.Reader documentation 18 is the canonical example: the type signature is trivial (Read(p []byte) (n int, err error)), but the documentation specifies six distinct behavioral contracts — partial reads, EOF semantics, error-and-bytes interaction, concurrency, buffer mutation, and idempotency. The contracts are enforced only by convention, but they are uniformly located: every interface in the Go ecosystem is expected to document them in the same way.

gRPC

Encodes behavioral expectations in proto file comments 13, with conventional locations for streaming semantics, deadline behavior, and idempotency. The proto compiler does not enforce them, but tooling such as buf 19 can lint for their presence.

The gap in Python is structural. There is no standard place — below the level of typing.Protocol definitions, and Protocol does not encode behavior either — to record a behavioral contract such that it is consistently located, machine-readable, and at least lint-checkable for presence. PEP 727 8 gestures in this direction but is concerned with rendered documentation rather than enforceable contract. Libraries like icontract 14 and deal 14 offer design-by-contract runtime checking, but they require explicit precondition and postcondition expressions and are not widely adopted.

The practical implication is that Python codebases must invent their own convention. The convention this paper proposes, building on the Rust # Safety pattern and the Go interface-doc pattern, is described in §11.

Section 10

Detection Patterns

Before mitigation, detection. Five signals reliably indicate the presence of an invisible behavioral contract in Python code that integrates with an SDK. None of them is sufficient on its own, but in combination they catch the great majority of real cases.

1

# type: ignore on a return statement or assignment of an SDK-derived value.

The marker indicates that the type system has been overridden in favor of the author's behavioral knowledge. Every such marker is a latent contract. See the hooks module of KellerAI's internal codebase (line 104) for the canonical case.

2

Bare dict subscript or .get() access on an SDK-typed object.

The pattern message.data["field"] or input_data.get("tool_name") indicates a payload contract not encoded in the type system. The author is asserting, without compiler support, that the named key will be present and of the expected type.

3

raise of a non-domain exception inside an SDK callback.

Constructions of the form raise RuntimeError("unexpected ...") inside a function invoked by the SDK rather than by application code are nearly always a defensive assertion against a dispatch contract. The exception is the author's machine-readable expression of "the SDK should not ever call me this way".

4

Exact-version dependency pins on SDKs that are pre-1.0 or otherwise unstable.

A pin of the form package==0.1.53 with no upper bound comment is suspicious. The pin is doing semantic work — usually a behavior freeze — that is not expressible in the version specifier. Audit pins of this shape carefully.

5

Natural-language hedges in code comments.

Phrases such as "this should be fine", "we assume", "as far as I can tell", and "this is currently true" are author confessions of unverified behavioral assumption. A comment near line 434 of the agent module in KellerAI's internal codebase — "this should be fine since we are likely only confined to Claude running here" — is a textbook instance.

A simple grep-based audit using these five signals can be run on any codebase in under a minute. The signals are noisy — some # type: ignore markers are legitimate, some hedges are well-considered — but they are an effective starting point. They convert a category of risk that is otherwise invisible into a finite list of lines to read.

Section 11

Mitigation Strategies

Detection without mitigation is just inventory. Five concrete mitigations follow.

01

Explicit guard with log-and-alert, not crash.

The can_use_tool guard at callbacks.py:68 should not raise RuntimeError. It should log at error level, emit a metric or alert, and return a safe default (typically deny) for any unexpected tool. The application continues to run; the operator gets a signal; the contract violation is observable without being fatal. The fix is two lines.

02

TypedDict with explicit key checks.

Model every SDK payload as a TypedDict with total=False, and access fields through a typed wrapper that performs explicit key existence checks and raises a domain-specific exception with a clear message when a key is missing. The wrapper is where the payload contract is encoded; if the SDK changes the schema, the wrapper is the single point of update.

03

Behavioral contract tests per callback.

For every SDK callback, write a test that exercises the callback against a mock SDK that simulates every known dispatch case. For can_use_tool, that means a test for every tool the agent can call, not only AskUserQuestion. The test asserts that the callback behaves correctly under each dispatch case. If the SDK ever widens the dispatch contract, the test is updated; if the application's assumption was wrong, the test fails immediately.

04

CHANGELOG-BEHAVIORAL.md separate from the dependency changelog.

Maintain a project-local file that records every behavioral assumption the codebase makes about its SDK dependencies, the version range over which the assumption is known to hold, and the location in the codebase where the assumption is encoded. Update the file in the same pull request that adds or modifies the assumption. The file is the artifact that future engineers — and future versions of yourself — will read before bumping a pin.

05

Upper-bound pins with an audit annotation.

Replace exact pins on behaviorally-sensitive dependencies with bounded ranges, and annotate the upper bound with the audit requirement. The bound is the technical mechanism. The comment is the social mechanism: it tells the next engineer that the bump is not routine and points them at the document that records why.

Mitigation 5 — upper-bound pin with audit annotation
toml
"claude-agent-sdk>=0.1.53,<0.2.0",  # behavioral-contract audit required before bumping; see CHANGELOG-BEHAVIORAL.md §can_use_tool

Status in the case-study codebase

The Kai codebase has applied these mitigations only partially. The table is included not to criticize the team but to show the gap that is typical of production codebases — including ones whose engineers are aware of the problem.

MitigationStatusNotes
1 — Log-and-alert guardNot appliedcallbacks.py:68 still raises RuntimeError; the fix was committed (0a3b81a3) and reverted (65f507a3).
2 — TypedDict payloadsNot appliedinput_data["questions"] (callbacks.py:105) and message.data["session_id"] (agent.py:484) remain stringly-typed.
3 — Per-dispatch callback testsNot appliedtest_kai_callbacks.py::test_non_ask_user_question_tool_raises pins the old contract rather than testing each dispatch case.
4 — CHANGELOG-BEHAVIORAL.mdNot appliedNo such file exists in the repository.
5 — Upper-bound pin with audit annotationPartialpyproject.toml:16 is an exact pin (==0.1.53) with a feature-floor comment but no upper-bound audit annotation.
Section 12

The General Pattern

can_use_tool is one instance of a universal pattern. Any function that is invoked by a framework rather than by the application — any callback, hook, event handler, signal receiver, plugin entry point, or middleware — has behavioral contracts that exceed its type signature. The contracts are not specific to LLM SDKs, to Python, or to any particular framework. They are the cost of inversion of control. 18

The taxonomy from §02 generalizes:

01
Dispatch contractsgovern when the callback fires. can_use_tool is a dispatch contract. So is "this Django signal fires after save() but before post_save()", and so is "this pytest plugin hook fires once per test session, not once per test".
02
Payload contractsgovern what shape the callback receives. The input_data["questions"] access is a payload contract. So is the structure of a React useEffect cleanup function's expected return value, or the keys in a SQLAlchemy before_flush event's flush_context.
03
Ordering contractsgovern the sequence across multiple callbacks. "session-start fires before any message" is an ordering contract. So is "pre-request middleware runs in reverse order of post-request middleware" in many web frameworks.
04
Idempotency and frequency contractsgovern how often and how repeatably the callback can be called. "This callback may be called multiple times for the same logical event" is a contract that React hooks document explicitly and most other ecosystems leave implicit.
05
Side-effect contractsgovern what the callback is allowed to do. "Do not call set_state from a render function" is the canonical React example.

The pattern applies to pytest plugins 11 , Django signals, React hooks, SQLAlchemy events, Flask middleware, and many others. Each ecosystem has evolved its own conventions for documenting these contracts, ranging from thorough (React's exhaustive hook rules) to nearly absent (most Python SDKs). The general lesson is that every inversion-of-control surface needs a behavioral contract notation , and the lack of one is a recurring source of production incidents.

Section 13

Implications for SDK Authors

This paper has been written from the consumer's perspective, but the producer side deserves explicit treatment. SDK authors who change behavioral semantics within a single major or even minor version — particularly in 0.x where many libraries live indefinitely — without documenting the change in a place consumers will look are violating an implicit but widely-held industry norm.

The norm is not "do not change behavior". Behavior must change; bugs get fixed; design improves. The norm is: if you change behavior, say so where consumers will see it, in terms consumers can act on . The Claude Agent SDK change from 0.1.52 to 0.1.53 widened the dispatch of can_use_tool from a single tool to all tools. That is a substantial behavioral change. A consumer reading only the release notes might or might not have noticed it; a consumer reading only the type signatures would not have noticed at all.

Three concrete recommendations:

01
Behavioral changelog sections. Separate behavioral changes from API changes in release notes. An API change is "added a new optional parameter"; a behavioral change is "this callback now fires under additional conditions". The two have different review obligations for consumers.
02
Semver behavioral extension. Adopt or propose a convention — perhaps a fourth segment, or a marker in the patch field — for releases that change behavior without changing API surface. The community has experimented with this for ABI compatibility; behavioral compatibility deserves the same treatment.
03
Explicit migration guides for dispatch changes. When a callback's dispatch contract widens or narrows, the SDK should publish a short migration guide naming the affected callbacks and showing a before/after example of the consumer code that needs to change.

These are recommendations, not prescriptions. SDK authors are doing work that is in many ways harder than consumer work, and the bar for additional process should be set carefully. But the consumer cost of not doing this work is the kind of incident this paper has used as its running example, multiplied across every consumer of the affected SDK.

Section 14

Conclusion

We return to the opening scene. The deployment went out. The agent crashed. Commit 65f507a3 reverted the SDK pin. The incident closed in hours.

The fix was small and effective. It was also not a fix. The underlying behavioral assumption — that can_use_tool fires only for AskUserQuestion — was not removed, not documented, not tested, and not made visible. It was preserved in amber, expressed as the equality operator in a pyproject.toml pin. This was not amnesia: the team had encountered the assumption, attempted a forward fix in 0a3b81a3, and chose eight days later to revert and freeze the version instead. The pin is the deliberate codification of a deferral. The next engineer to bump that pin, weeks or months from now, will rediscover the assumption in the same way the team did the first time: through a production crash.

The real fix is to surface the contract: to write down, in code and in tests and in documentation, what the application is assuming about the SDK it depends on, so that the next time that assumption is violated, the violation is caught at the boundary rather than in production.

The cost is modest. The five mitigations in §11 take, in aggregate, perhaps a day of focused work for a codebase of the size we have discussed. The cost of not doing it is unbounded, paid in increments, on schedules chosen by the SDK author rather than by the consumer.

Behavioral contracts exist whether or not you write them down. Writing them down is the cheapest production-grade engineering investment available, and the one most consistently neglected. The assumption you didn't write down is the production incident you haven't had yet.

References
  1. 1Raemaekers, S., Visser, J., & Bouwers, E. (2017). Putting the Semantics into Semantic Versioning. arXiv:2008.07069 [cs.SE].
  2. 2Robbes, R., Lungu, M., & Röthlisberger, D. (2012). A Study on Behavioral Backward Incompatibilities of Java Software Libraries. ACM/IEEE ICSE. DOI: 10.1145/3092703.3092721.
  3. 3Hora, A., Valente, M. T., & Robbes, R. (2020). Taming behavioral backward incompatibilities via cross-project testing and analysis. ACM ESEC/FSE.
  4. 4Jiang, Z., Shao, S., & Lo, D. (2023). Understanding the Impact of APIs Behavioral Breaking Changes on Clients. IEEE TSE.
  5. 5Sun, Y., Li, X., Zhao, Y., Chen, B., & Zhu, Y. (2024). A Large-scale Investigation of Semantically Incompatible APIs behind Compatibility Issues in Android Apps. IEEE TSE.
  6. 6Lin, J., Liu, J., Jing, N., Zhou, Y., & Li, Y. (2024). Pinning Is Futile: Why and How Version Pinning Fails. arXiv:2502.06662 [cs.SE].
  7. 7Liu, K., Passarella, G., Kim, M., & Muralidharan, V. (2024). Test Before You Deploy: Governing Updates in the LLM Supply Chain. arXiv:2604.27789 [cs.SE].
  8. 8Levkivskyi, I., Lehtosalo, J., & Langa, Ł. (2019). PEP 544 – Protocols: Structural subtyping (static duck typing). Python Enhancement Proposal. peps.python.org/pep-0544
  9. 9Goodman, M. D., & Levkivskyi, I. (2020). PEP 589 – TypedDict: Type Hints for Dictionaries with a Fixed Set of Keys. Python Enhancement Proposal. peps.python.org/pep-0589
  10. 10Preston-Werner, T. (2013). Semantic Versioning 2.0.0 Specification. semver.org
  11. 11Anthropic. Agent SDK Reference — Python. Claude Agent SDK Documentation. docs.anthropic.com/en/docs/claude-code/sdk/sdk-python
  12. 12Anthropic. Claude Agent SDK Changelog v0.1.50–v0.2.82. GitHub Releases. github.com/anthropics/claude-agent-sdk-python/releases
  13. 13Sorensen, C., Sorensen, J., & Theriault, D. (2021). What Kinds of Contracts Do ML APIs Need? FSE 2023 (arXiv:2307.14465).
  14. 14Tiwari, P., Xie, H., & Sha, L. (2024). Agent Behavioral Contracts: Formal Specification and Runtime Enforcement for Reliable Autonomous AI Agents. arXiv:2602.22302 [cs.SE].
  15. 15Hejlsberg, A. et al. (2023). Toward a Corpus Study of the Dynamic Gradual Type. arXiv:2503.08928 [cs.PL].
  16. 16Dutta, K., Arora, N., & Namin, A. S. (2024). On the Freshness of Pinned Dependencies in Maven. arXiv:2510.22815 [cs.SE].
  17. 17Jafari, M., Pourrezaei, M., & Lo, D. (2021). Characterizing Dependency Update Practice of NPM, PyPI and Cargo Packages. arXiv:2403.17382 [cs.SE].
  18. 18Martin Fowler. Published Interface. Fowler's Bliki. martinfowler.com/bliki/PublishedInterface.html
  19. 19Martin Fowler. Parallel Change (Expand and Contract). Fowler's Bliki. martinfowler.com/bliki/ParallelChange.html
  20. 20Spotify Engineering. Fleet Management at Spotify (Part 3): Fleet-wide Refactoring. Spotify Engineering Blog. engineering.atspotify.com/2023/05/fleet-management-at-spotify-part-3-fleet-wide-refactoring
  21. 21Meyer, B. Object-Oriented Software Construction, 2nd ed. Prentice Hall, 1997. Original formulation of design by contract.
  22. 22PEP 440 — Version Identification and Dependency Specification. Python Software Foundation, 2014. peps.python.org/pep-0440
  23. 23Robinson, I. Consumer-Driven Contracts: A Service Evolution Pattern. martinfowler.com, 2006. martinfowler.com/articles/consumerDrivenContracts.html