Executive Summary
On a routine Tuesday, an SDK upgrade went out. Type checks passed. The full unit-test suite passed. Linters were clean. CI was green. The deployment promoted to production, and within minutes the agent runtime started crashing on every conversation that issued a tool call. The crash signature was a RuntimeError raised from a callback the team had written months earlier and not touched since. The fix, recorded as commit 65f507a3, was a one-line revert. The incident closed in hours. The assumption that caused it remained, expressed not as a comment or a test, but as a version pin in pyproject.toml.
The assumption was this: that the SDK's can_use_tool callback fires only when the agent invokes AskUserQuestion. Under SDK version 0.1.52 and earlier, that assumption was true. Under 0.1.53 and later, the same callback fires for every tool the agent calls. The type signature did not change. The function name did not change. The argument shapes did not change. What changed was the behavioral dispatch contract, and Python's type system does not encode behavioral dispatch contracts. There was no place for the change to be detected automatically, because there was no place where the original behavior had been written down.
This paper argues that behavioral contracts — the dispatch, payload, ordering, and frequency assumptions that surround every SDK callback — are production hazards that are systematically invisible to type checkers, linters, and unit tests. We use the can_use_tool incident as a running case study, walk through five related patterns in the same codebase, survey how other language ecosystems handle the same problem, and propose detection and mitigation patterns that surface invisible contracts before they reach production.
A behavioral assumption that lives only in your head is a production incident with no fixed date.
The Silent Breakage Problem
The phrase "behavioral contract" is older than the SDKs we use today, but it has narrowed in the type-system literature to mean roughly what design-by-contract advocates 21 meant by it: the preconditions, postconditions, and invariants that a function promises but does not encode in its signature. We use it here in a broader, more operational sense: a behavioral contract is any property of a function or callback that callers depend on, that is not expressible in the type system the language gives them .
A type contract says: this function takes a str and a dict and returns a PermissionResult. A behavioral contract says: this function is invoked exactly once per user-facing question, only when the agent decides to ask, and never for ordinary tool calls. The first is checkable. The second is folklore.
Python's type system, even with the modern type-hint stack, enforces shape and not dispatch. A Callable[[str, dict, ToolPermissionContext], Awaitable[PermissionResult]] describes what arguments arrive and what comes back. It does not describe when the callable is invoked, which tools cause it to fire, how often it can be called within a single agent turn, or what keys the dict is expected to contain. Each of those is a separate, undocumented contract negotiated between the SDK author and the SDK user, and renegotiated — often silently — at every release.
A taxonomy of behavioral contracts
- 01Dispatch contract. Which events or conditions cause the callback to fire. The can_use_tool change is a dispatch contract change.
- 02Payload contract. Which keys, of which types, will be present in dict-shaped arguments. Stringly-typed input_data["questions"] access is a payload contract.
- 03Ordering contract. The sequence in which a series of callbacks fires for a single logical operation — for example, "pre-tool" before "post-tool", or "session-start" before any "message" callback.
- 04Calling-frequency contract. How many times a callback can fire per logical operation. Idempotency assumptions live here.
- 05Authorization-delegation contract. Which downstream component is the authoritative authorizer when a callback is reached. A change to which hook owns which tool — invisible at the callback's type boundary — silently re-routes authorization.
None of these contracts are checkable by mypy, pyright, ruff, or any production-grade Python tool we are aware of. They are checkable, in principle, by integration tests that exercise the real SDK against real inputs — but only if the test author knew the contract existed in the first place. The contracts that hurt are the ones nobody wrote down, because nobody noticed they were making them.
Anatomy of an Invisible Contract
The clearest example in the codebase is the can_use_tool callback registered with the Claude Agent SDK 12 . The relevant code lives in the agent callbacks module of KellerAI's internal codebase (lines 58–69):
callbacks.py:58–69 — can_use_toolpython
async def can_use_tool(
self,
tool_name: str,
input_data: dict,
context: ToolPermissionContext,
) -> PermissionResultAllow | PermissionResultDeny:
"""SDK callback. Interrupts on first call, delivers answers on re-call."""
del context
logger.debug(f"can_use_tool called for {tool_name}")
if tool_name != "AskUserQuestion":
raise RuntimeError(f"can_use_tool called for unexpected tool: {tool_name}")Read this code in isolation and it is unremarkable. The type signature is fully annotated. The discriminated return type ( PermissionResultAllow | PermissionResultDeny) is precise. The docstring is present. A reviewer scanning a pull request that introduced this function would have nothing to flag.
The behavioral contract, however, is encoded only in the guard on line 68. The author believed — correctly, at the time of writing — that the SDK invoked can_use_toolonly when the agent was about to issue an AskUserQuestion tool call. Under that assumption, the raise RuntimeError is not paranoia; it is a defensive assertion that the SDK has not invoked the callback under any other circumstance, and that if it ever does, the program should fail loudly rather than silently grant or deny permission for a tool nobody anticipated.
This assumption was true in SDK versions 0.1.52 and earlier. Between 0.1.52 and 0.1.53 12 , the SDK's dispatch contract widened: can_use_tool began firing for every tool invocation, not only AskUserQuestion. The function name did not change. The argument types did not change. The return type did not change. The semantically critical change — which events cause this callback to fire — was not visible in any artifact that a type checker could read.
The original commit 0a3b81a3 upgraded the SDK pin. The application started. The first agent turn that invoked any tool — a read_file, a run_bash, anything — caused can_use_tool to fire with that tool's name, the guard on line 68 fired, and the runtime crashed. Commit 65f507a3 reverted the pin and closed the incident.
It is worth pausing on the asymmetry of the guard. Under the old contract, raising RuntimeError on an unexpected tool was correct — it would have indicated a genuine SDK bug or a tool name typo and should have halted execution. Under the new contract, the same line of code is catastrophic — it converts every routine tool call into a fatal exception. The line did not change. Its meaning inverted. This is the defining property of an invisible contract: identical code, identical types, opposite behavior, with no syntactic marker that anything has shifted.
The Type Stub Gap
Type stubs cover what stubs can cover: the shape of function signatures, the structure of dataclasses, the variance of generics. They do not cover what the function does, how often it is called, or what keys appear in the dicts it returns. When a stub is incomplete — when a return type is Any, or when the maintainer has not yet annotated a piece of the surface — the user fills in the gap by hand. The mechanism for doing so, by convention in the Python ecosystem, is # type: ignore.
hooks.py:101–104 — type: ignore returnpython
# The appropriate types for getting this response to match what is expected # in HookMatches seem to be missing from the claude-agent-sdk package as of # version 0.1.22 return response # type: ignore
The comment is honest. It records the SDK version against which the stub gap was observed and the function's purpose. What it does not record is the behavioral assumption the author made when deciding what response should contain. The # type: ignore silences the type checker, but it does not document the assumption that the SDK will accept whatever shape response happens to have.
This pattern is worth naming. In production Python codebases, # type: ignore on a return or an assignment is a reliable marker for a point at which the codebase steps out of the type system and into informal contract territory. Every # type: ignore is a place where the author privately concluded that they understood the SDK's behavior well enough to bypass the checker, and where their understanding is the only enforcement mechanism. Six months later, when the SDK is bumped and the author has moved teams, the # type: ignore becomes a load-bearing memory.
A reasonable heuristic for codebase audits: treat every # type: ignore as a latent behavioral contract that must be re-validated against the current SDK version on every upgrade . The cost is non-trivial — each marker requires reading the relevant SDK source or changelog — but the cost is precisely the cost of the upgrade, made visible.
Untyped Payload Contracts
The second category of invisible contract is payload-shape assumption: the codebase reaches into a dict-shaped value provided by the SDK and pulls out a key by name, without any type-level guarantee that the key exists or has the expected value type.
Several instances appear in the same module. At the session message handler in KellerAI's internal codebase (agent module, line 484):
agent.py:484 — bare subscript accesspython
session_id = message.data["session_id"]
The message.data attribute is typed loosely (likely dict[str, Any] or similar) and the code accesses session_id by string key. A nearby comment, around line 479, acknowledges that "something has gone wrong with our assumptions about the session lifecycle" — a near-perfect declaration of an undocumented contract whose violation has already been observed.
At the hooks module of KellerAI's internal codebase (line 116):
hooks.py:116 — .get() payload accesspython
tool_name = input_data.get("tool_name")This is slightly safer — .get() returns None rather than raising KeyError — but the behavioral assumption is still present: the caller expects that for the relevant hook event, tool_name will be populated with a string. Downstream code that compares tool_name to a literal will silently fail if the key is absent. At the callbacks module of KellerAI's internal codebase (line 105):
callbacks.py:105 — unguarded subscriptpython
input_data["questions"]
This is unguarded subscript access. If the SDK changes the payload schema for the AskUserQuestion permission event — renames the field, nests it under another level, or sends an empty payload during initialization — the line raises KeyError and the agent run halts.
The common thread is stringly-typed dict access on SDK-controlled payloads whose schema is itself an undocumented behavioral contract. The mitigation is straightforward and underused: model SDK payloads as TypedDict with total=False and explicit key existence checks. We return to this in §11.
Version Pinning as Behavior Freeze
After the incident, the team made an unusual choice. Rather than fix the underlying assumption — wrap the guard, expand can_use_tool to handle every tool, refactor the callback — they pinned the SDK to the version that worked. From pyproject.toml:16:
pyproject.toml:16 — exact pintoml
"claude-agent-sdk==0.1.53", # 0.1.53: stable sonnet-only test baseline; 0.1.50+ for RateLimitEvent and Bedrock header fix
The comment documents why the pin floor is at 0.1.50 (a specific feature and a specific bug fix) and why 0.1.53 is the chosen point (a stable test baseline). It does not document why the pin is exact — why ==0.1.53 rather than >=0.1.53,<0.2.0. The answer, which is not written down anywhere, is that 0.1.53 is the last version known to honor the dispatch contract the codebase depends on.
The pin is doing two jobs. As a stability pin, it ensures reproducibility: every CI run, every deploy, every developer's local environment installs the same SDK version. As a behavior freeze, it prevents the silent dispatch widening from striking again. The first job is well understood; the second is invisible. A new engineer reading pyproject.toml sees a normal exact-version pin and has no way to know that bumping it requires re-auditing several files.
This is a category of technical debt the dependency-management ecosystem does not yet name. PEP 440 22 gives us version specifiers; semantic versioning 10 gives us a convention for API compatibility; neither gives us a convention for behavioral compatibility below the patch level. The pin comment documents the minimum version we need for features, but the maximum version we can tolerate is encoded only in the pin's equality operator. If the pin were ever loosened — even accidentally, during a routine dependency cleanup — the incident would recur.
Every SDK release above 0.1.53 is therefore a potential silent breakage spanning multiple files: callbacks.py, hooks.py, agent.py, and pyproject.toml itself. The pin is not a fix; it is a deferral, with no scheduled re-examination.
The Accumulation Cost
Each behavioral assumption raises the upgrade tax monotonically. The first time a team writes raise RuntimeError("unexpected tool") in a callback, the audit cost of an SDK upgrade is one file. The second time they write message.data["session_id"], it is two. By the time the codebase has accumulated the four active hazard sites described in this paper, the audit cost of any SDK upgrade is four files plus pyproject.toml itself.
| Location | Hazard type | Description |
|---|---|---|
| callbacks.py:68 | can_use_tool dispatch guard | raises RuntimeError on any non-AskUserQuestion tool |
| hooks.py:104 | # type: ignore return | behavioral assumption about response shape bypasses type checker |
| agent.py:302–304 | global environment-variable mutation | contract about mutation order and isolation |
| agent.py:484 | message.data["session_id"] | payload contract: assumes key always present |
This cost compounds because each site requires not only re-reading the local code but also re-reading the SDK changelog or source for the relevant version range, identifying whether any behavioral contract has shifted, and either updating the local code or confirming that no update is needed. In practice, this work is rarely done in advance. It is done — if it is done at all — as part of the incident response after the upgrade has already broken production.
The pin is, in effect, a deferred renegotiation with the SDK author. Every release the team skips is a contract change they have not read. When the renegotiation finally happens — driven by a security CVE, a required feature, or a deprecation deadline — the team must pay the accumulated cost in one transaction. This is the same dynamic that makes long-lived database schema migrations and long-lived OS upgrade cycles more painful per release than frequent, smaller ones. 6
The mitigation is not "upgrade more often" in the abstract; it is make the upgrade tax visible at the time the assumption is written , so that the cost is amortized rather than spiked. We discuss concrete mechanisms in §11.
Why This Is Hard to Test
A reasonable reader will object: surely tests should catch this. They do not, and the reasons are structural.
Unit tests mock the SDK. They have to: invoking the real SDK from a unit test requires a model endpoint, network access, and non-deterministic latency. The mock is constructed by the test author against their understanding of the SDK's behavior — which is to say, against the same mental model that encoded the invisible contract. The mock fires can_use_tool only for AskUserQuestion, because that is what the test author believes the SDK does. The test passes. The contract change does not surface.
The Kai codebase illustrates this exactly. A test file test_kai_callbacks.py exists and contains a test named test_non_ask_user_question_tool_raises that pins the old dispatch contract: it asserts that can_use_tool raises RuntimeError for any non-AskUserQuestion tool. The test is well-written and passes against the current code. What it cannot detect is the SDK-side behavioral change, because the test exercises the application's handler against the application's mental model of the SDK — not against the real SDK's dispatch behavior. After an SDK upgrade that widens dispatch, the test would still pass; it has no way to know the SDK now invokes the callback under additional conditions. The test pins the assumption it should be challenging.
Integration tests can, in principle, run against the real SDK. In practice, they are expensive — slow, flaky, costly to run on every PR — and they are typically scoped to a small number of happy-path flows. The flow that triggers the contract change is a flow the test suite has not been written to cover, because the team did not know it needed to be covered.
Type checking passes by construction: the contract is not in the types. Linters pass: there is no syntactic marker for behavioral assumption. The first real signal of breakage is, almost invariably, a production crash or a careful reading of the SDK changelog. Both are reactive.
Contract testing in the consumer-driven style 23 (Pact and its descendants) offers a partial mitigation: the consumer publishes a contract specifying its expectations, the provider verifies against it, and contract changes become visible at the version boundary. This works well for HTTP APIs and is increasingly used for asynchronous messaging. It has not, to our knowledge, been adapted to in-process SDK callbacks. We treat this as an open area in §11.
Industry Patterns
Other language ecosystems have wrestled with the same problem and arrived at partial solutions, none of which fully translate to Python but all of which are instructive.
Uses the JetBrains @Contract annotation 13 to encode a small vocabulary of behavioral contracts directly in source: purity, nullability of inputs and outputs, conditional return values. The annotation is read by IntelliJ's static analyzer and can be exported in formats other tools consume. It does not handle dispatch contracts, but it encodes enough of payload and ordering to be useful in practice. The Java Modeling Language (JML) 14 goes further, offering full design-by-contract preconditions and postconditions, at the cost of more annotation overhead than most teams will accept.
Does not have a general behavioral contract notation, but uses the unsafe keyword 18 as an explicit marker of behavioral contract territory. Any unsafe block is, by convention, accompanied by a comment specifying the invariants the programmer is asserting. The compiler does not check the invariants, but the marker is mandatory — there is no way to enter unsafe territory accidentally. Trait contracts in Rust are documented in the standard library with conventional # Safety and # Panics sections, which serve as informal but consistently-located behavioral contracts.
Relies almost entirely on interface documentation conventions. The standard library's io.Reader documentation 18 is the canonical example: the type signature is trivial (Read(p []byte) (n int, err error)), but the documentation specifies six distinct behavioral contracts — partial reads, EOF semantics, error-and-bytes interaction, concurrency, buffer mutation, and idempotency. The contracts are enforced only by convention, but they are uniformly located: every interface in the Go ecosystem is expected to document them in the same way.
The gap in Python is structural. There is no standard place — below the level of typing.Protocol definitions, and Protocol does not encode behavior either — to record a behavioral contract such that it is consistently located, machine-readable, and at least lint-checkable for presence. PEP 727 8 gestures in this direction but is concerned with rendered documentation rather than enforceable contract. Libraries like icontract 14 and deal 14 offer design-by-contract runtime checking, but they require explicit precondition and postcondition expressions and are not widely adopted.
The practical implication is that Python codebases must invent their own convention. The convention this paper proposes, building on the Rust # Safety pattern and the Go interface-doc pattern, is described in §11.
Detection Patterns
Before mitigation, detection. Five signals reliably indicate the presence of an invisible behavioral contract in Python code that integrates with an SDK. None of them is sufficient on its own, but in combination they catch the great majority of real cases.
# type: ignore on a return statement or assignment of an SDK-derived value.
The marker indicates that the type system has been overridden in favor of the author's behavioral knowledge. Every such marker is a latent contract. See the hooks module of KellerAI's internal codebase (line 104) for the canonical case.
Bare dict subscript or .get() access on an SDK-typed object.
The pattern message.data["field"] or input_data.get("tool_name") indicates a payload contract not encoded in the type system. The author is asserting, without compiler support, that the named key will be present and of the expected type.
raise of a non-domain exception inside an SDK callback.
Constructions of the form raise RuntimeError("unexpected ...") inside a function invoked by the SDK rather than by application code are nearly always a defensive assertion against a dispatch contract. The exception is the author's machine-readable expression of "the SDK should not ever call me this way".
Exact-version dependency pins on SDKs that are pre-1.0 or otherwise unstable.
A pin of the form package==0.1.53 with no upper bound comment is suspicious. The pin is doing semantic work — usually a behavior freeze — that is not expressible in the version specifier. Audit pins of this shape carefully.
Natural-language hedges in code comments.
Phrases such as "this should be fine", "we assume", "as far as I can tell", and "this is currently true" are author confessions of unverified behavioral assumption. A comment near line 434 of the agent module in KellerAI's internal codebase — "this should be fine since we are likely only confined to Claude running here" — is a textbook instance.
A simple grep-based audit using these five signals can be run on any codebase in under a minute. The signals are noisy — some # type: ignore markers are legitimate, some hedges are well-considered — but they are an effective starting point. They convert a category of risk that is otherwise invisible into a finite list of lines to read.
Mitigation Strategies
Detection without mitigation is just inventory. Five concrete mitigations follow.
Explicit guard with log-and-alert, not crash.
The can_use_tool guard at callbacks.py:68 should not raise RuntimeError. It should log at error level, emit a metric or alert, and return a safe default (typically deny) for any unexpected tool. The application continues to run; the operator gets a signal; the contract violation is observable without being fatal. The fix is two lines.
TypedDict with explicit key checks.
Model every SDK payload as a TypedDict with total=False, and access fields through a typed wrapper that performs explicit key existence checks and raises a domain-specific exception with a clear message when a key is missing. The wrapper is where the payload contract is encoded; if the SDK changes the schema, the wrapper is the single point of update.
Behavioral contract tests per callback.
For every SDK callback, write a test that exercises the callback against a mock SDK that simulates every known dispatch case. For can_use_tool, that means a test for every tool the agent can call, not only AskUserQuestion. The test asserts that the callback behaves correctly under each dispatch case. If the SDK ever widens the dispatch contract, the test is updated; if the application's assumption was wrong, the test fails immediately.
CHANGELOG-BEHAVIORAL.md separate from the dependency changelog.
Maintain a project-local file that records every behavioral assumption the codebase makes about its SDK dependencies, the version range over which the assumption is known to hold, and the location in the codebase where the assumption is encoded. Update the file in the same pull request that adds or modifies the assumption. The file is the artifact that future engineers — and future versions of yourself — will read before bumping a pin.
Upper-bound pins with an audit annotation.
Replace exact pins on behaviorally-sensitive dependencies with bounded ranges, and annotate the upper bound with the audit requirement. The bound is the technical mechanism. The comment is the social mechanism: it tells the next engineer that the bump is not routine and points them at the document that records why.
Mitigation 5 — upper-bound pin with audit annotationtoml
"claude-agent-sdk>=0.1.53,<0.2.0", # behavioral-contract audit required before bumping; see CHANGELOG-BEHAVIORAL.md §can_use_tool
Status in the case-study codebase
The Kai codebase has applied these mitigations only partially. The table is included not to criticize the team but to show the gap that is typical of production codebases — including ones whose engineers are aware of the problem.
| Mitigation | Status | Notes |
|---|---|---|
| 1 — Log-and-alert guard | Not applied | callbacks.py:68 still raises RuntimeError; the fix was committed (0a3b81a3) and reverted (65f507a3). |
| 2 — TypedDict payloads | Not applied | input_data["questions"] (callbacks.py:105) and message.data["session_id"] (agent.py:484) remain stringly-typed. |
| 3 — Per-dispatch callback tests | Not applied | test_kai_callbacks.py::test_non_ask_user_question_tool_raises pins the old contract rather than testing each dispatch case. |
| 4 — CHANGELOG-BEHAVIORAL.md | Not applied | No such file exists in the repository. |
| 5 — Upper-bound pin with audit annotation | Partial | pyproject.toml:16 is an exact pin (==0.1.53) with a feature-floor comment but no upper-bound audit annotation. |
The General Pattern
can_use_tool is one instance of a universal pattern. Any function that is invoked by a framework rather than by the application — any callback, hook, event handler, signal receiver, plugin entry point, or middleware — has behavioral contracts that exceed its type signature. The contracts are not specific to LLM SDKs, to Python, or to any particular framework. They are the cost of inversion of control. 18
The taxonomy from §02 generalizes:
The pattern applies to pytest plugins 11 , Django signals, React hooks, SQLAlchemy events, Flask middleware, and many others. Each ecosystem has evolved its own conventions for documenting these contracts, ranging from thorough (React's exhaustive hook rules) to nearly absent (most Python SDKs). The general lesson is that every inversion-of-control surface needs a behavioral contract notation , and the lack of one is a recurring source of production incidents.
Conclusion
We return to the opening scene. The deployment went out. The agent crashed. Commit 65f507a3 reverted the SDK pin. The incident closed in hours.
The fix was small and effective. It was also not a fix. The underlying behavioral assumption — that can_use_tool fires only for AskUserQuestion — was not removed, not documented, not tested, and not made visible. It was preserved in amber, expressed as the equality operator in a pyproject.toml pin. This was not amnesia: the team had encountered the assumption, attempted a forward fix in 0a3b81a3, and chose eight days later to revert and freeze the version instead. The pin is the deliberate codification of a deferral. The next engineer to bump that pin, weeks or months from now, will rediscover the assumption in the same way the team did the first time: through a production crash.
The real fix is to surface the contract: to write down, in code and in tests and in documentation, what the application is assuming about the SDK it depends on, so that the next time that assumption is violated, the violation is caught at the boundary rather than in production.
The cost is modest. The five mitigations in §11 take, in aggregate, perhaps a day of focused work for a codebase of the size we have discussed. The cost of not doing it is unbounded, paid in increments, on schedules chosen by the SDK author rather than by the consumer.
Behavioral contracts exist whether or not you write them down. Writing them down is the cheapest production-grade engineering investment available, and the one most consistently neglected. The assumption you didn't write down is the production incident you haven't had yet.