Skip to main content
kellerai.blog

The Assumption You Didn't Write Down

How invisible SDK behavioral contracts become production incidents.

KellerAI White Paper · Dependency & Contracts · May 2026

Context

On a routine Tuesday, an SDK upgrade went out. Type checks passed. The full unit-test suite passed. Linters were clean. CI was green. Within minutes of the deploy, the agent runtime started crashing on every conversation that issued a tool call. The fix was a one-line revert. The incident closed in hours. The assumption that caused it remained, expressed not as a comment or a test, but as a version pin in a config file. A behavioral dispatch contract had changed — which events cause a callback to fire — and Python's type system does not encode behavioral dispatch contracts. There was no artifact for a type checker to read, because there was no place where the original behavior had been written down.

The Finding

Behavioral contracts — the dispatch, payload, ordering, frequency, and authorization-delegation assumptions that surround every SDK callback — are production hazards that are systematically invisible to type checkers, linters, and unit tests. The real fix is to surface the contract: to write down, in code & tests & documentation, what the application assumes about the SDK it depends on, so the next violation is caught at the boundary rather than in production. Five mitigations do this: explicit guards with log-and-alert rather than crash, TypedDict payloads with explicit key checks, behavioral contract tests per callback, a behavioral changelog separate from the dependency changelog, & upper-bound pins with explicit audit notes.

Tags:
SDK ReliabilityBehavioral ContractsProduction Safety
Paper Details
CategoryDependency & Contracts
AudienceEngineering teams, SDK consumers, & platform architects building reliable agent systems
MethodIncident case study (KellerAI internal, kwp-isha.10) + taxonomy of behavioral contracts + detection & mitigation patterns
Length~1,000 · 4 min
Sections7
DateMay 2026
AuthorsKellerAI
Read the full paper
Related
Placeholder — pending analytics
Section 01

The incident that passed every check

On a routine Tuesday, an SDK upgrade went out. Type checks passed. The full unit-test suite passed. Linters were clean. CI was green. Within minutes of the deploy, the agent runtime started crashing on every conversation that issued a tool call. The fix — recorded as commit 65f507a3 (KellerAI internal codebase) — was a one-line revert. The incident closed in hours. The assumption that caused it remained, expressed not as a comment or a test, but as a version pin in a config file.

The assumption was simple: that the SDK's can_use_tool callback fires only when the agent decides to ask the user a question. Under one version of the SDK that was true. Under the next minor release, the same callback began firing for every tool the agent called. The function name did not change. The argument types did not change. The return type did not change. What changed was — in the paper's phrase — the “behavioral dispatch contract,” and Python's type system does not encode behavioral dispatch contracts. There was no artifact for a type checker to read, because there was no place where the original behavior had been written down.

Section 02

Type contracts and behavioral contracts

A type contract says: this function takes a string and a dict and returns a result. That is checkable. A behavioral contract says: this function is invoked exactly once per user-facing question, only when the agent decides to ask, and never for ordinary tool calls. That is folklore. Around every callback sit four more contracts the type system cannot express — which events cause it to fire, which keys its dict arguments will contain, in what order a series of callbacks runs, and how many times each can fire per operation. Each is negotiated silently between the SDK author and the SDK user, and renegotiated, often without notice, at every release.

Section 03

Why the tests can't catch it

The natural objection is that tests should catch this. They do not, for a structural reason. Unit tests mock the SDK — they have to, since the real SDK needs a model endpoint and a network. But the mock is built by the test author against their own understanding of the SDK, which is the very same mental model that encoded the invisible assumption. The codebase even contains a well-written test asserting that can_use_tool raises an error for any unexpected tool. After a dispatch-widening upgrade, that test still passes — it has no way to see the SDK now behaves differently. It pins the assumption it should be challenging. Type checks pass by construction; the contract is not in the types. The first real signal of breakage is a production crash.

Section 04

An exact pin is a frozen assumption

After the incident, the team did not fix the underlying assumption. They pinned the SDK to the exact version that worked. That pin is doing two jobs at once. As a stability pin, it ensures every environment installs the same version — well understood. As a behavior freeze, it quietly prevents the silent dispatch change from striking again — invisible. A new engineer reading the config sees an ordinary exact-version pin and has no way to know that bumping it requires re-auditing several files. An exact-version pin on a fast-moving SDK is almost always a behavioral assumption preserved in amber. Loosen it during a routine cleanup and the incident recurs.

Section 05

The cost only grows

Every behavioral assumption raises the upgrade tax. The first defensive guard in a callback makes an SDK upgrade a one-file audit. The next stringly-typed dict access makes it two. Each skipped release is a contract change the team has not read, and the bill compounds. When the renegotiation finally arrives — forced by a security fix, a required feature, or a deprecation deadline — the team pays the whole accumulated cost in a single transaction, usually during an incident response rather than in advance.

Section 06

Write the contract down

The real fix is to surface the contract: to write down, in code and tests and documentation, what the application assumes about the SDK it depends on, so the next violation is caught at the boundary rather than in production. That means guards that log and alert instead of crashing, typed wrappers around SDK payloads, tests that exercise every dispatch case, a behavioral changelog kept beside the code, and version pins with an explicit upper-bound audit note. The cost is modest — perhaps a day of focused work. The cost of skipping it is unbounded, paid in increments, on a schedule the SDK author chooses rather than you.

Section 07

Further reading

For the full argument — the can_use_tool case study line by line, a five-part taxonomy of behavioral contracts, how Java, Rust, Go, and gRPC handle the same problem, and detection and mitigation patterns with code — read the companion technical whitepaper, The Assumption You Didn't Write Down: In Depth .