Skip to main content
kellerai.blog

Fixpoints, Not Fixes

Resource-lifecycle bugs don't have a fix — they have a fixpoint.

KellerAI White Paper · In-Depth · Code Quality & Architecture · Jun 2026 · ~26 min read

Context

A fixpoint is a state the system returns to under its own dynamics. Most production incidents in async services are not failures of logic; they are failures to name the fixpoint the system will adopt when no other contract constrains it. A background task with no terminal-state contract has a fixpoint: the row stays INITIALIZED forever. A semaphore with no scope bound has a fixpoint: a deadlock that can hold for 93 minutes before an operator notices. A subscription system without identity-based teardown has a fixpoint: zombie handles that accumulate until the browser's connection pool collapses. Each of these is the system finding its own attractor in the absence of one the engineer forgot to encode.

This paper studies five resource-lifecycle anti-patterns drawn from a single production codebase — the KellerAI platform — between February & May 2026. Three of the five are evidenced by code present in the current repository HEAD. Two are evidenced by commit history: the remediation commits were applied but are not present in the current main branch at time of writing, reflecting ongoing branch reorganization. All commit hashes cited are verifiable via git show.

The Finding

Simplification done wrong produces compound debt that is observable in the engineering record. The cost of a missing contract is proportional to how long it remained unwritten. The fix is not a code change. The fix is naming the fixpoint: encoding the terminal-state invariant, the lock-scope bound, the ownership predicate, the ordering rule — such that the system cannot drift into the failure mode under any input. Fixes patch the symptom that surfaced. Fixpoints constrain every symptom that has not surfaced yet.

Each chapter pairs the fixpoint concept with a plain English restatement: 'the state the system stays in when the function that should move it forward does not run.' The point is not to import mathematical machinery but to import a habit: when reviewing async code, ask which terminal state is the attractor when each await boundary throws, times out, or is cancelled. If the answer is 'I do not know,' the contract is not yet encoded.

Tags:
Async Invariants & FixpointsResource Lifecycle GovernanceTechnical Debt Calculus
Cite this paper

KellerAI. (2026, June 1). Fixpoints, Not Fixes: In Depth. KellerAI. https://kellerai.blog/fixpoints-not-fixes-in-depth

Paper Details
CategoryCode Quality & Architecture
AudienceEngineering teams, platform architects, & compliance leads responsible for production async systems & contract specification
MethodField study of five production sagas (KellerAI platform, Feb-May 2026) + commit-series analysis + debt calculus from incident record + structured-concurrency literature (Erlang supervision trees, Trio nurseries, asyncio.TaskGroup) + design-by-contract methodology (Hoare, Meyer)
Length~6,300 · ~26 min
Reading levelTechnical
Sections13
References22
Versionv1.0 · Updated Jun 2026
PublishedJun 2026
Key Takeaways
  • Five resource-lifecycle archetypes — unsupervised task, unscoped lock, unowned handle, unordered subscription, & unenforced invariant — share a structure: each is a system where the operative contract was never written down.
  • The cost of a missing contract is not the missing contract itself; it is the saga that has to be re-performed every time that contract resurfaces in a new shape — compounding non-linearly over time.
  • Five principles verifiable at code-review time encode the missing contracts: background task & status share an invariant; lock scope equals resource scope; subscriptions have identity-based ownership; subscribe before you publish; & comments are documentation but constraints are contracts.
Related
Placeholder — pending analytics
Section 01

Executive Summary

A fixpoint is a state the system returns to under its own dynamics. Most production incidents in async services are not failures of logic; they are failures to name the fixpoint the system will adopt when no other contract constrains it. A background task with no terminal-state contract has a fixpoint: the row stays INITIALIZED forever. A semaphore with no scope bound has a fixpoint: a deadlock that can hold for 93 minutes before an operator notices. A subscription system without identity-based teardown has a fixpoint: zombie handles that accumulate until the browser's connection pool collapses. Each of these is the system finding its own attractor in the absence of one the engineer forgot to encode.

This paper studies five resource-lifecycle anti-patterns drawn from a single production codebase—the KellerAI platform—between February and May 2026. Three of the five (Chapters 1, 3, and 5) are evidenced by code present in the current repository HEAD ( bbda054f). Two (Chapters 2 and 4) are evidenced by commit history: the remediation commits for those chapters were applied to the repository but are not present in the current main branch at time of writing, reflecting ongoing branch reorganization. All commit hashes cited are verifiable via git show. For each, we present the pre-fix code, the production incident that surfaced it, the commit series that drove the resolution, and the principle that would have prevented the saga had it been encoded the first time. The five archetypes—background-task supervision, concurrency-token scope, subscription handle ownership, subscribe-before-publish ordering, and comment-only invariants—share a structure: each is a system in which the operative contract was never written down, only assumed.

Simplification done wrong produces compound debt that is observable in the engineering record. The cost of a missing contract is proportional to how long it remained unwritten. The fix is not a code change. The fix is naming the fixpoint: encoding the terminal-state invariant, the lock-scope bound, the ownership predicate, the ordering rule—such that the system cannot drift into the failure mode under any input. Fixes patch the symptom that surfaced. Fixpoints constrain every symptom that has not surfaced yet.

Section 02

The Fixpoint Frame

We borrow the term “fixpoint” from dynamical systems informally. A fixpoint of a function f is a state x* satisfying f(x*) = x*; static program analysis uses fixed-point iteration, grounded in the Banach fixpoint theorem and lattice theory, to prove that abstract interpretation converges. 12 We do not import this mathematical machinery—software state machines are not metric spaces, and the analogy is intuitive, not formal. What we borrow is the habit of asking: from the current state, is there a function the system can apply that moves it forward?

A software system has fixpoints too, though we rarely call them that. The state of a database row is a fixpoint when no further transition is possible—when the status field has no successor under any code path the system can reach. A held semaphore is a fixpoint when no path from the current call stack leads to its release. A subscription handle is a fixpoint when no reachable code calls .close() on it. In each case the question is the same: from the current state, is there a function the system can apply that moves it forward? If not, the current state is the fixpoint.

The vocabulary matters because it forces a precise question. “Why is this project stuck in INITIALIZED?” invites speculation about transient causes—a slow disk, a network hiccup, a deployment race. “What is the terminal-state fixpoint of run_kai when prepare_kai_workspace raises?” invites a specific answer: the task exits, no status update fires, the row's fixpoint is INITIALIZED, and it will remain INITIALIZED until an external operator changes it. The second framing makes the bug visible at code-review time. The first framing surfaces only when production accumulates enough orphans to be noticed.

We use the term lightly throughout this paper. Every chapter pairs it with a plain English restatement: “the state the system stays in when the function that should move it forward does not run.” The point is not to import mathematical machinery but to import a habit: when reviewing async code, ask which terminal state is the attractor when each await boundary throws, times out, or is cancelled. If the answer is “I do not know,” the contract is not yet encoded.

Section 03

The Debt Calculus: Cost Proportional to Time

Technical debt as compound interest is Martin Fowler's metaphor 15 and a quantifiable empirical phenomenon in modern measurement frameworks. 14 What the metaphor misses is that the principal is the unwritten contract and the interest payments are the operational sagas that accumulate around it. Each saga is a datum. A git log --grep=ARC-446 that returns four commits over thirty-six hours is a record of an unwritten contract being inferred from production behavior, one edge case at a time.

The KellerAI platform's incident record shows the calculus directly. ARC-446, the subscription handle ownership saga, produced four commits over two days: an initial cleanup hook ( 9110ddbe), a race guard against post-cleanup map mutation (30b6e70e), a back-forward cache exception (c65c347f), and a consolidating PR (6d34326f). Each commit corrected an edge case the previous commit had not anticipated, and each anticipated only what production had already shown. ARC-485, the semaphore deadlock saga, produced a postmortem document, a timeout patch ( e240e77c), and an observability backfill ( dd2c7a19)—none of which would have been written had the original aenter declared “this semaphore must not be held across user-input waits.” ARC-477 and ARC-478 are restoration sagas: a backfill script for projects stranded at INPUT_NEEDED and a lifespan hydrator service for projects orphaned by pod restarts. Both were one-shot utilities, removed once the immediate fleet was repaired, and the commits remain in the history as the record of work that would not have been necessary had the original background task encoded its terminal-state contract.

Self-admitted technical debt studies confirm the pattern at scale: debt clusters around resource lifecycle, configuration, and unenforced invariants. 16 The accumulation is not linear. A contract unwritten for one day produces a hotfix. A contract unwritten for six months produces a hydrator service, a backfill script, an observability dashboard, a runbook entry, and four engineers who can each tell you a different war story about how it surfaced. We cannot establish this direction of cost inequality counterfactually—we do not know what it would have cost to write those contracts at design time. For any codebase you maintain, apply the same estimation to your own incident record: how much engineering time did each saga cost, and how long was the contract unwritten before it surfaced?

The debt is not the missing contract. The debt is the saga that must be performed every time the missing contract surfaces in a slightly new shape.

Fixes patch the symptom that surfaced. Fixpoints constrain every symptom that has not surfaced yet.

§3 — The Debt Calculus

This is why we frame the work as “fixpoints, not fixes.” A fix resolves a symptom and resets the principal to zero on that one symptom. A fixpoint—an encoded terminal-state contract, a runtime invariant check, a database constraint, a structured-concurrency boundary—resolves the symptom class and prevents the principal from accruing again. Hoare's axiomatic basis for programming 17 and Meyer's design-by-contract methodology 1819 both make the same point: the cost of an unwritten precondition is paid by every future reader who must reconstruct it from runtime behavior.

Section 04

INITIALIZED Forever: Background Task with No Terminal-State Contract

The simplest fixpoint in any state machine is the state the system never leaves. Inside run_kai() (KellerAI internal codebase — project-launch API route), a background asyncio task drives a new project from INITIALIZED through workspace preparation into either RUNNING or FAILED. The task is launched via asyncio.create_task, and its work happens inside a workspace-preparation routine (KellerAI internal codebase — workspace-preparation service) that touches the filesystem, clones a repository, and writes to the database. Any of those steps can raise. The original code specified no terminal status for the project if any of those steps raised.

The outer try/except in the project-launch route carries a comment that is itself an artifact of the saga: “Catch-all so unexpected prepare failures land as FAILED with a generic message—without it the task escaped to asyncio and the project sat at INITIALIZED forever.” The comment is the postmortem. The original task launched, the exception escaped to the event loop, the default asyncio behavior logged the traceback to stderr, and the database row stayed INITIALIZED indefinitely because no code path remained that could change it. The system's fixpoint, under the input “any exception inside prepare_kai_workspace,” was a non-terminal state that no human had declared as terminal.

The remediation required two separate repair efforts. Commit 5d371b18 (ARC-477) introduced a backfill script (KellerAI internal codebase) in the original commit that scanned the projects table for rows with status='inputRequired' AND kai_session_id IS NULL and walked them back to a sane terminal state. The script was a one-shot utility and has since been removed from the codebase; the authoritative version remains in the commit. Commit 648a0b18 (ARC-478) added a lifespan hydrator service (KellerAI internal codebase — pod-startup resume service) that ran during pod startup and resumed or terminated projects left in THINKING, CANCELLING , or INPUT_NEEDED when the previous pod died. Both commits are recorded evidence of the work required to repair a fleet whose original supervision contract did not exist.

Structured concurrency frameworks treat this anti-pattern as a foundational error. Erlang's supervision trees, 3 Trio's nursery model, 4 and the structured-concurrency proposals adapted into asyncio task groups 5 share an invariant: a task's exit and a parent-known status transition are tied. The parent observes every child's outcome. The child cannot escape into a runtime that does not represent it. Plain asyncio.create_task does not enforce this; references to created tasks can be dropped, exceptions can escape to the default exception handler, and the database row that records the task's status is the only durable witness to its outcome. If the task fails to update that row, the row's value becomes the fixpoint of a system in which no other code can change it.

The principle: a background task and a status field share an invariant. Task exit if and only if status is terminal. Encode it through a structured-concurrency wrapper, a context-manager-scoped task launcher, or at minimum an explicit try/except/finally that sets a terminal status in every branch—or accept that a future operator will write a backfill script when the fleet has drifted far enough to be noticed. The catch-all at projects.py:583-586 is the encoded form of that invariant in the current codebase. The comment above it is the record of what it cost to learn it.

Section 05

93 Minutes: Concurrency Token Outlives Its Resource Scope

_kai_semaphore (KellerAI internal codebase — agent client module), defined as asyncio.Semaphore(settings.max_concurrent_kai_processes) , is the platform's cap on parallel agent runs. The default capacity is two. The semaphore is acquired in the client's aenter before the client connects to the underlying agent process; it is released in aexit inside a finally clause. In isolation the pattern is unremarkable—a context-managed bounded resource pool, the standard Python idiom for capacity control.

 # KellerAI internal codebase — agent client (illustrative; not verbatim) _agent_semaphore = asyncio.Semaphore(settings.max_concurrent_agent_processes) class AgentClient: async def __aenter__(self): await _agent_semaphore.acquire() # held until __aexit__ # ... async def __aexit__(self, *_): _agent_semaphore.release() 

The pattern broke on production traffic in the incident recorded as ARC-485. The Kai client's lifetime is not the lifetime of the agent's compute work. A Kai run can transition into INPUT_NEEDED and wait for a user to type a response into the web UI. While it waits, the client is still alive and the semaphore is still held. The user, on a Tuesday afternoon, may close the laptop and forget. On the day of the incident, two such abandoned sessions held both semaphore slots for ninety-three minutes. Three new project creations queued silently behind them. There was no alert, no error log, and no metric that surfaced the condition—the system was operating within its declared capacity envelope, doing exactly what its code instructed: holding a token until its scope ended, when its scope had been silently redefined by the addition of a user-input wait.

The remediation is recorded in commit e240e77c (ARC-485) and its observability follow-up dd2c7a19. Both commits exist in the repository's git history and are verifiable via git show, but neither is an ancestor of the current main branch HEAD (bbda054f)—the fixes were committed and subsequently removed or superseded. The timeout fix, as recorded in e240e77c, wrapped the subscription wait inside handle_follow_up_question and handle_plan_proposal with asyncio.wait_for(..., timeout=settings.kai_user_input_timeout_seconds) . The observability fix, in dd2c7a19, introduced Prometheus gauges— kai_semaphore_in_flight and kai_semaphore_waiting—and a thirty-second acquisition timeout on aenter. The current codebase HEAD does not contain either fix; the semaphore currently has no timeout on aenter and no Prometheus instrumentation. The saga—postmortem, timeout patch, observability backfill—is preserved in the commit record as evidence of the work a missing scope contract generated. The principle it illustrates holds regardless of the current branch state.

The deeper issue is one of declared versus enacted scope. Formal-methods treatments of distributed locks 67 make the scope of a lock a first-class property of its declaration: the lock is held over a region that has been verified not to contain an unbounded wait. In ad-hoc concurrency code, the scope is whatever the lexical structure of async with happens to enclose, and adding an await for user input changes the scope silently—there is no type, no annotation, no static check that would have caught the divergence. The semaphore became a latent deadlock because its scope was extended by a code change that did not appear, to the engineer who made it, to be touching the semaphore at all.

The principle: a concurrency limiter that holds its token across unbounded I/O waits is a latent deadlock, and the duration to outage is a function of capacity and arrival rate, not of code quality. Lock scope must match resource scope. Either the lock is released before the unbounded wait and reacquired after—which requires the engineer to declare what the lock is protecting at each side of the wait—or the wait is bounded by a timeout the lock's scope can absorb. The timeout is the immediate remediation; it bounds the damage a misused semaphore can cause. The fixpoint—the encoded contract—requires type-level or structured-concurrency enforcement that makes the misuse structurally impossible: a Semaphore wrapper type that requires an explicit release_before_wait() call at the type level, or a structured-concurrency boundary that prevents holding the token across any user-input await. A comment declaring “this semaphore must not be held across user-input waits” is documentation toward the fixpoint, not the fixpoint itself. By the standard of Chapter 5, a comment-only scope declaration is the same anti-pattern in a different archetype.

Section 06

The 6-Connection Cap: SSE Handle Identity Without Ownership

The KellerAI web client subscribes to server-sent event streams for live project updates. The browser's EventSource API exposes a connection lifecycle—CONNECTING, OPEN, CLOSED— defined in the HTML living standard. 10 Each open connection consumes one of the browser's six concurrent HTTP/1.1 sockets to a given origin. When connections are not closed, they accumulate. The accumulation has a fixpoint: the connection pool exhausts, every subsequent fetch to the same origin stalls in the queue, and the user experiences the application as frozen.

The pre-fix code in the frontend API-service module (KellerAI internal codebase) held two module-scoped data structures, projectSubscriptions and workspaceSubscription, that mapped logical subscription identifiers to EventSource instances. Vite's hot module replacement re-evaluated the module on every save during development. Each re-evaluation produced fresh empty data structures, dropping the references to the previous EventSource instances without calling .close(). The garbage collector did not help; an open EventSource is held alive by the network layer until its connection is explicitly closed. After approximately ten saves, the developer would observe fetches to localhost stalling for periods that grew with each subsequent save—empirically, roughly 126 seconds—because the browser had no socket available to send the request through.

The fix arrived as a four-commit series under ARC-446, each commit closing an edge the previous had not anticipated:

Commit

Change

Edge Closed

9110ddbeVite HMR dispose + pagehide cleanupSockets on module re-evaluation and full page unload
30b6e70eIdentity guard before .delete()Stale unsubscribe callback mutating map after fresh subscribe
c65c347fif (event.persisted) return inside pagehideBFCache restoration: restored page expects connections alive
6d34326fConsolidating PR; lifecycle cleanup moduleNamed closeEventSource, closeProjectEventSources, et al.

The BFCache edge is the tell. When a standard browser lifecycle event—a documented part of the platform—requires special-casing to preserve correctness, the diagnosis is that handle ownership was never explicit to begin with. The map held the handles, the handlers ran on every page transition, but no one had declared who owned a given handle's lifetime, and so every transition produced a new question about whether to close. A subscription system with explicit ownership has a single answer: close if and only if the owner is being torn down. The current code expresses that ownership through the identity guard—the handle is owned by the subscription record stored under its key, and any handler that does not hold the same identity has no business closing it.

The principle: a subscription system without identity-based teardown has no resource lifecycle, only resource churn the server cannot observe. The WebSocket protocol 89 and the EventSource standard 10 both encode a lifecycle precisely because the underlying transport's resource cost is real and the client must declare when it has finished with a given connection. A client that mutates handle-holding data structures without identity-checked teardown has rejected the lifecycle and adopted a fixpoint of growing connection count. That fixpoint is bounded by the browser's pool size, which is also the bound at which the bug becomes visible.

Section 07

Write, Then Listen: The Subscribe-First TOCTOU

The function handle_follow_up_question (KellerAI internal codebase — project API route handler) is the canonical example in the codebase of a pattern that looks correct, passes review, works under low concurrency, and fails under any load that exercises the race window. The function persists a follow-up question to the database, transitions the project status to INPUT_NEEDED, and then subscribes to the project's event stream to wait for the user's answer. The persistence call at line 617— follow_up_id = await db.create_follow_up(...) —fires a NOTIFY on the project's channel. The status update at line 622 fires another. The subscription opens at line 625. Between line 617 and line 625, any FollowUpAnsweredEvent that arrives is published to a channel with no subscribers and is silently dropped. The function then enters async for event in db.subscribe_to_project(project_id) and waits for an event that already fired. Its fixpoint, under the input “user answers very fast,” is to wait forever.

This is a classic time-of-check time-of-use race, formally enumerated as CWE-367, 12 and treated in the event-driven systems literature as a fundamental ordering hazard. 11 The pattern in handle_follow_up_question is its most recognizable form: publish before subscribe, observe the loss when timing happens to favor it. Pub/sub systems that promise ordering guarantees 13 assume a subscriber is registered at or before the publication. Postgres LISTEN/NOTIFY, the underlying mechanism here, is explicit on this point—a notification fires only to currently registered listeners and is not buffered for late arrivals.

Commit 5d371b18 (ARC-477) is the record of the fix and an example of a contract made explicit after the fact. Its commit message describes the corrected sequence:

TOCTOU fix in handle_follow_up_question (api/projects.py): persist resume state FIRST, then async-with open_subscription (eager queue registration before any await), then create_follow_up + set_project_status(INPUT_NEEDED), then re-check via get_follow_up_info before entering wait loop. Closes the window where a NOTIFY can arrive between create_follow_up and queue subscription.

The fix is db.open_subscription(project_id), an async context manager that registers a queue with the SubscriptionManager (KellerAI internal codebase — database subscription layer) before any database write occurs. After the context is entered, the publisher's writes are guaranteed to fire into a registered queue. A re-check via get_follow_up_info between subscription and wait closes the remaining window—if the answer has already been recorded by the time the wait loop is entered, the function returns immediately rather than blocking on an event that will never come. This open_subscription context manager was described in the commit body of 5d371b18 but is not present in the current codebase HEAD ( bbda054f); the underlying SubscriptionManager infrastructure exists (line 35 of _subscriptions.py), but the pre-fix write-then-subscribe ordering at lines 604–648 of projects.py remains in place. The pre-fix code is visible as a live example of the anti-pattern the commit intended to eliminate.

The principle: in any event-driven system, subscribe before you publish. This is not a stylistic convention. It is a correctness requirement that follows directly from the semantics of any pub/sub mechanism that does not buffer for late subscribers. Lamport's analysis of distributed event ordering 22 is the formal foundation; the principle survives every concrete mechanism that descends from it. The corollary, equally important, is to re-check the predicate after subscribing—the subscription closes the future window, and the re-check closes the past window. Together they exhaust the race.

Section 08

The Comment Is the Contract: Invariant Encoded in Prose

Of the five anti-patterns in this paper, the fifth is the only one with no incident, no commit hash, no postmortem, and no remediation. It is the minimal failure mode—a load-bearing invariant that exists in the system but is enforced nowhere. In the project API route handler (KellerAI internal codebase), a comment describes the protocol for handling follow-up questions: “Flow: 1. Save questions to DB, 2. Wait for FollowUpAnsweredEvent, 3. Set status back to THINKING.” The implication is the invariant: when a project's status is INPUT_NEEDED, a follow-up row must exist. The database has no CHECK constraint that enforces this. No migration adds one. No Pydantic validator checks it. No assert statement in the database layer mentions it. No code path in the API layer verifies it before reading or writing. A search across the migrations directory, the database subscription layer (KellerAI internal codebase), and the API routes returns no enforcement of any kind.

The invariant is nonetheless real and load-bearing. Downstream code reads from the follow-ups table when the project status is INPUT_NEEDED and assumes the read returns a row. A future engineer who writes a new code path that transitions a project to INPUT_NEEDED without creating a corresponding follow-up row will introduce a None into a context that has no None handler. The bug will surface only on the specific code path that exercises the violation, and the diagnosis will require reading the comment to discover what the system silently expected. The cost is paid in incident time, by a future engineer, against an invariant that exists only as prose.

This is the gap design-by-contract was articulated to close. 171819 A precondition that is not executable is, for the system, equivalent to no precondition. The reader of the code has the contract. The runtime does not. And the runtime is the only party whose opinion matters at three in the morning when a previously unreachable code path becomes reachable. Formal FSM-completeness analyses 2021 treat invariants of this kind as proof obligations—every reachable state must satisfy the predicate, and every transition into the state must establish it. A comment is documentation toward the proof obligation. A constraint or assertion is the discharge of it.

The principle: a comment is documentation, and a constraint is a contract. If a state invariant has no runtime enforcer, the invariant does not exist for the system—only for the reader. The remediation is mechanical: a CHECK constraint at the database level ( status != 'INPUT_NEEDED' OR EXISTS (SELECT 1 FROM follow_ups WHERE project_id = projects.id AND answered_at IS NULL) ), an assertion at the service-layer transition into INPUT_NEEDED, or a Pydantic validator on the project's serialized form. Any of the three would convert the comment into a constraint. Until one of them is added, the invariant lives where every other unwritten contract in this paper has lived—in the head of the engineer who wrote it, until the engineer leaves the team and the next reader has to guess.

Section 09

The Five Principles

Each chapter ends in a rule verifiable at code-review time. Stated together they form the core prescriptive content of this paper.

  1. 1.
    Background task and status share an invariant. Task exit if and only if status is terminal. If your code spawns a task via asyncio.create_task and the task’s outcome is reflected in a status field, no path through the task may exit without writing to that field. The catch-all is the encoded form; structured-concurrency boundaries are the stronger form.
  2. 2.
    Lock scope equals resource scope. A concurrency token may not be held across an unbounded I/O wait. Either the wait is bounded by a timeout the token’s contention envelope absorbs, or the token is released before the wait and reacquired after. The change in scope must be visible at the call site.
  3. 3.
    Subscriptions have identity-based ownership. Every subscription handle has an owner. The owner closes it. Code that mutates a handle-holding data structure without identity-checking the handle it is about to release has no ownership model and will accumulate handles until a system bound surfaces the leak.
  4. 4.
    Subscribe before you publish. In any event-driven system whose pub/sub mechanism does not buffer for late subscribers, the subscriber must be registered before the writer issues the publication, and the predicate must be re-checked after registration to exhaust the prior window.
  5. 5.
    Comments are documentation; constraints are contracts. A state invariant with no runtime enforcer does not exist for the runtime. If the invariant is load-bearing, encode it as a database constraint, a Pydantic validator, an assertion at the transition site, or a CI test that fails on its violation.

The principles are not theorems—they admit explicit, declared exceptions. A lock may be held across a bounded user-input wait if the operator has accepted the contention cost in writing. A subscription may be unowned if its handle's lifetime is provably shorter than the next garbage-collection cycle of the underlying transport. The principles fail correctly: they make the exception visible at code-review time, and they shift the burden of proof from “why should we encode this?” to “why are we choosing not to encode this?”

Section 10

The Fixpoint Taxonomy

The five anti-patterns sort into a taxonomy by failure mode.

ArchetypeResourceMissing ContractFixpoint Under FailureSurface Signal
Unsupervised taskDatabase rowTerminal-state guaranteeRow stays in non-terminal stateStranded rows accumulate over time
Unscoped lockConcurrency tokenScope-equals-resource boundToken held across user waitCapacity saturation; silent queue
Unowned handleNetwork socket / EventSourceIdentity-based teardownHandle never closedConnection pool exhaustion
Unordered subscriptionEvent channelSubscribe-before-publish orderingWait for already-fired eventIndefinite hang under load
Unenforced invariantState predicateRuntime checkReachable state violates predicateNone in downstream consumer

The taxonomy is useful at code-review time. For every new await that touches a resource, ask which archetype the resource belongs to. For every new state transition, ask which predicate must hold and where it is enforced. For every new subscription, ask who owns it and when they close it. For every new task creation, ask what terminal state is guaranteed on every exit path. The questions are mechanical. The answers, written down, are the contracts.

The taxonomy also predicts where future incidents will originate in any codebase that resembles ours. A new async feature that introduces a variant of the unscoped-lock archetype—say, a long-lived database connection held across a redis lock—will fail the same way the semaphore failed, for the same reason, with the same shape of postmortem. The cost of identifying the archetype at design time is one paragraph in a design document. The cost of identifying it after production has surfaced it is the incident, the hotfix, the runbook, the observability backfill, and the four engineers who each have a different war story.

Section 11

Engineering the Named Invariant

None of these techniques are novel; most have been in the literature for decades. The claim is narrower: adopting them at the cost of incremental friction at design time reduces the length and frequency of the sagas described in §3—as the incident record above shows, though only for this codebase.

Type-state patterns

A handle whose API methods are typed by lifecycle state—Connecting, Open, Closed—forces ownership transitions to be syntactically visible. The compiler refuses to call .send() on a Closed handle. TypeScript's discriminated unions and Python's typing.Protocol give weaker variants of the same property. The cost is one type definition; the benefit is that the third anti-pattern (unowned handle) becomes uncompilable.

Structured concurrency

A task group, nursery, or asyncio.TaskGroup enforces that no child task escapes its parent's lifetime. The parent observes every child's exit. A task that fails to update its status field can no longer drop the failure on the floor—the failure propagates to the parent, which is now responsible for the terminal-state update. The cost is one wrapper around asyncio.create_task; the benefit is that the first anti-pattern (unsupervised task) becomes structurally absent.

Database constraints

Every load-bearing state invariant has a representation in the schema. CHECK constraints, NOT NULL columns, foreign keys with ON DELETE RESTRICT, partial indexes that enforce uniqueness over subsets of rows—each is a runtime enforcer that runs on every write. The cost is one migration; the benefit is that the fifth anti-pattern (unenforced invariant) becomes a constraint violation at insert time, surfaced to the application as a clear error rather than a silent state divergence.

Timeout by default

Any await that crosses a process boundary, a user interaction, or a third-party service is wrapped in asyncio.wait_for with an explicit timeout. The default timeout is short enough that the surrounding lock's contention envelope absorbs it. A code review that finds an unbounded await is a code review that has found the second anti-pattern; the remediation is mechanical.

Eager subscription

Every event-driven coordination opens its subscription as an async context manager before issuing the publish. The pattern can be enforced by code-review checklist or, more strongly, by API design that does not expose a publish primitive separable from a subscription handle.

Invariant tests

For every documented invariant that cannot be encoded as a database constraint, a CI test traverses representative state transitions and asserts the predicate. The test fails when a new code path violates the predicate. The cost is the test; the benefit is that the comment is no longer the only enforcer.

Together, these techniques move the cost of contract-writing from runtime—where operators, on-call engineers, and customers pay it—to design time, where the engineer making the change has the context to write the contract correctly.

Section 12

Counter-Arguments

Production systems surface bugs; engineers fix them; the codebase improves over time. Why frame ordinary corrective work as evidence of compound debt?

The answer is in the saga shapes. A bug that surfaces once, is diagnosed once, and is fixed once is ordinary debugging. A bug that surfaces in four distinct edge cases over thirty-six hours—each edge case anticipated only by production behavior—is the discovery, one fragment at a time, of a contract that was never written down. The four-commit ARC-446 series is the shape that matters. The single commit that would have prevented all four was the one that declared subscription ownership at the time the subscription system was designed. That commit was not written, and so the four commits that did get written were the price of its absence.

A weaker form of the same objection is that no codebase fully encodes its invariants and demanding it is impractical. The argument of this paper is not that every invariant must be encoded. It is that every load-bearing invariant—every invariant whose violation produces a customer-visible failure, a stranded row, or a 93-minute deadlock—has a cost calculus that strongly favors encoding it at design time, and the saga record is the empirical evidence of that calculus. The principles in §9 are filters: they identify, at code-review time, which invariants are load-bearing enough to warrant the encoding cost.

A third objection is that the principles trade engineering velocity for excessive ceremony. Friction at design time delays delivery; the metaphor of compound interest is a metaphor, and the real cost may be smaller than the friction. The DORA-correlated technical-debt measurement literature 14 disputes the empirical claim; the saga record in our own codebase disputes it more concretely. For any codebase you maintain, compute the cumulative hours spent on any one saga of the shape described in §3. The friction of writing the original contract is one paragraph.

A fourth objection, and perhaps the most honest one, is that the principles require design-time knowledge engineers often do not have. To know that a semaphore must not be held across a user-input wait, an engineer must anticipate, at the time they write the semaphore, that a user-input wait will be added later. In our own codebase, the ARC-485 deadlock occurred not because engineers were careless but because the user-input wait was added in a separate feature from the semaphore, and the engineer adding the wait had no reason to know it fell inside the semaphore's scope. The principles are most useful not as design-time precautions but as code-review triggers: the engineer reviewing the addition of an await in a semaphore-guarded region is the one who should ask whether the scope bound is being preserved. The cost is a code-review question; the savings are the incident and its saga.

Section 13

Limitations

This is an n=1 case study. The codebase studied is a single production platform—KellerAI, a Python/TypeScript backend with asyncio, FastAPI, Postgres, and a React/Vite frontend. The anti-patterns we identify are common to the language and runtime choices we made; readers using Go, Rust, or the BEAM may find the underlying archetypes reproduce but the surface forms differ.

We have not measured the cost of adopting the principles at scale. The recommendations in §11 carry their own friction, and we offer no controlled comparison against a codebase that did not adopt them. The argument is grounded in saga shapes from a single record, and the generalization to other teams and other codebases is a hypothesis the reader is invited to test against their own incident history.

We have not addressed concurrency models outside cooperative async (asyncio). Preemptive thread-based concurrency, actor models, and channel-based concurrency each have their own variants of the same archetypes; we suspect the principles port directly but do not show it. A follow-up paper applying the taxonomy to Erlang OTP or Rust's tokio runtime would strengthen the claim.

Finally, the paper studies failure modes, not successes. A codebase that adopted these principles from inception would generate no sagas to study, and the absence of evidence would, on this paper's framing, be the evidence of the principles' utility—but it would also be invisible to the record-based methodology we have used. A prospective study of a team adopting the principles, with incident frequency measured against a control period, would close the empirical gap this paper leaves open.

References
  1. 1Mödersheim, S., Vieira, B., et al. “Automating the Analysis and Improvement of Dynamic Datalog Programs.” Technical Report, Johns Hopkins University, 2023. doi:10.1145/3763129.
  2. 2Möller, A., and Schwartzbach, M. I. “Static Program Analysis.” Lecture Notes, Aarhus University, 2023. https://cs.au.dk/~amoeller/spa/spa.pdf.
  3. 3Armstrong, J. “Making Reliable Distributed Systems in the Presence of Software Errors.” Ph.D. Dissertation, Royal Institute of Technology (KTH), December 2003. https://erlang.org/download/armstrong_thesis_2003.pdf.
  4. 4Trio Project Contributors. “Trio: Async I/O for Humans and Machines.” Trio 0.32.0 Documentation, 2024. https://trio.readthedocs.io/en/stable/reference-core.html.
  5. 5Beazley, D., et al. “Introducing Quattro: Simple Structured Concurrency for asyncio.” Python Async-SIG Discussion, 2024. https://discuss.python.org/t/introducing-quattro-simple-structured-concurrency-for-asyncio/13160.
  6. 6Desai, A., and Behera, B. “Formal Methods for Distributed Systems.” AWS re:Invent 2023.
  7. 7Vieira, B., et al. “An Empirical Study on the Correctness of Formally Verified Distributed Systems Implementations.” Distributed Systems Workshop, 2023. https://homes.cs.washington.edu/~arvind/papers/dsbugs.pdf.
  8. 8IETF Hypertext Transfer Protocol Working Group. “The WebSocket Protocol.” RFC 6455, December 2011.
  9. 9Fette, I., and Melnikov, A. (Eds.). “The WebSocket API.” WHATWG Living Standard, 2011–. https://websockets.spec.whatwg.org.
  10. 10WHATWG. “9.2 Server-Sent Events.” HTML Living Standard. https://html.spec.whatwg.org/multipage/server-sent-events.html.
  11. 11Zalewski, M. “What Is Time-of-Check Time-of-Use (TOCTOU)?” DeepStrike, 2024.
  12. 12MITRE Corporation. “CWE-367: Time-of-check Time-of-use (TOCTOU) Race Condition.” Common Weakness Enumeration, 2024. https://cwe.mitre.org/data/definitions/367.html.
  13. 13DEV Community. “Ensuring Message Order in Distributed Systems: Addressing Pub/Sub Ordering Key Limitations.” 2024.
  14. 14Bowe, J., et al. “Technical Debt Quantification and Its Impact on Software Delivery Performance.” American Impact Review, 2024.
  15. 15Fowler, M. “Technical Debt.” https://martinfowler.com/bliki/TechnicalDebt.html.
  16. 16Biswas, D., et al. “23 Shades of Self-Admitted Technical Debt: An Empirical Study on Machine Learning Software.” ESEC/FSE ’22, Singapore, November 2022. doi:10.1145/3510003.3510052.
  17. 17Hoare, C. A. R. “An Axiomatic Basis for Computer Programming.” Communications of the ACM 12, no. 10 (October 1969): 576–580.
  18. 18Meyer, B. “Applying ‘Design by Contract’.” IEEE Computer 25, no. 10 (October 1992): 40–51.
  19. 19Meyer, B. Object-Oriented Software Construction. 2nd ed. Prentice Hall, 1997.
  20. 20Petrenko, A., and Yevtushenko, N. “Testing Timed Nondeterministic Finite State Machines with the Tk-Method.” IEEE Transactions on Software Engineering, 2017.
  21. 21Ionescu, C., and Rotaru, M. “Using an SMT Solver for Checking the Completeness of FSM-Based Test Suites.” INRIA Preprints, 2021.
  22. 22Lamport, L. “Time, Clocks, and the Ordering of Events in a Distributed System.” Communications of the ACM 21, no. 7 (July 1978): 558–565.