The incident that passed every check
On a routine Tuesday, an SDK upgrade went out. Type checks passed. The full unit-test suite passed. Linters were clean. CI was green. Within minutes of the deploy, the agent runtime started crashing on every conversation that issued a tool call. The fix — recorded as commit 65f507a3 (KellerAI internal codebase) — was a one-line revert. The incident closed in hours. The assumption that caused it remained, expressed not as a comment or a test, but as a version pin in a config file.
The assumption was simple: that the SDK's can_use_tool callback fires only when the agent decides to ask the user a question. Under one version of the SDK that was true. Under the next minor release, the same callback began firing for every tool the agent called. The function name did not change. The argument types did not change. The return type did not change. What changed was — in the paper's phrase — the “behavioral dispatch contract,” and Python's type system does not encode behavioral dispatch contracts. There was no artifact for a type checker to read, because there was no place where the original behavior had been written down.
Type contracts and behavioral contracts
A type contract says: this function takes a string and a dict and returns a result. That is checkable. A behavioral contract says: this function is invoked exactly once per user-facing question, only when the agent decides to ask, and never for ordinary tool calls. That is folklore. Around every callback sit four more contracts the type system cannot express — which events cause it to fire, which keys its dict arguments will contain, in what order a series of callbacks runs, and how many times each can fire per operation. Each is negotiated silently between the SDK author and the SDK user, and renegotiated, often without notice, at every release.
Why the tests can't catch it
The natural objection is that tests should catch this. They do not, for a structural reason. Unit tests mock the SDK — they have to, since the real SDK needs a model endpoint and a network. But the mock is built by the test author against their own understanding of the SDK, which is the very same mental model that encoded the invisible assumption. The codebase even contains a well-written test asserting that can_use_tool raises an error for any unexpected tool. After a dispatch-widening upgrade, that test still passes — it has no way to see the SDK now behaves differently. It pins the assumption it should be challenging. Type checks pass by construction; the contract is not in the types. The first real signal of breakage is a production crash.
An exact pin is a frozen assumption
After the incident, the team did not fix the underlying assumption. They pinned the SDK to the exact version that worked. That pin is doing two jobs at once. As a stability pin, it ensures every environment installs the same version — well understood. As a behavior freeze, it quietly prevents the silent dispatch change from striking again — invisible. A new engineer reading the config sees an ordinary exact-version pin and has no way to know that bumping it requires re-auditing several files. An exact-version pin on a fast-moving SDK is almost always a behavioral assumption preserved in amber. Loosen it during a routine cleanup and the incident recurs.
The cost only grows
Every behavioral assumption raises the upgrade tax. The first defensive guard in a callback makes an SDK upgrade a one-file audit. The next stringly-typed dict access makes it two. Each skipped release is a contract change the team has not read, and the bill compounds. When the renegotiation finally arrives — forced by a security fix, a required feature, or a deprecation deadline — the team pays the whole accumulated cost in a single transaction, usually during an incident response rather than in advance.
Write the contract down
The real fix is to surface the contract: to write down, in code and tests and documentation, what the application assumes about the SDK it depends on, so the next violation is caught at the boundary rather than in production. That means guards that log and alert instead of crashing, typed wrappers around SDK payloads, tests that exercise every dispatch case, a behavioral changelog kept beside the code, and version pins with an explicit upper-bound audit note. The cost is modest — perhaps a day of focused work. The cost of skipping it is unbounded, paid in increments, on a schedule the SDK author chooses rather than you.
Further reading
For the full argument — the can_use_tool case study line by line, a five-part taxonomy of behavioral contracts, how Java, Rust, Go, and gRPC handle the same problem, and detection and mitigation patterns with code — read the companion technical whitepaper, The Assumption You Didn't Write Down: In Depth .