Introduction — the bill always comes
Applying standard software-engineering practices to AI-engineering problems without a firm grasp of both will only accelerate tech debt — while the LLM praises your clean, “enterprise-grade” code. A false sense of security takes hold. Architectural mistakes the engineer would have caught in a non-AI codebase pile up in the blind spot of their new-found “productivity.” The bill always comes.
This paper is about what that bill looks like inside one real codebase, on two well-reviewed feature branches, written by competent engineers who shipped competent-looking code. The findings are not that the engineering is bad. The findings are that the engineering looks normal — recognizably idiomatic SQL, recognizably idiomatic Python, recognizably good service-layer hygiene — and that the normalcy is the problem. Standard practices were applied without a firm grasp of the second discipline in the room: how an LLM behaves as a runtime actor inside the system. The result is a class of latent defects that the LLM, asked to review its own PR diff, would not flag.
We focus on five joints in the architecture where the LLM and the rest of the system meet. At each joint, the code is locally reasonable. At each joint, the failure mode is not theoretical — it is a specific scenario, with named callers, named tables, and a named recovery path that does not exist. The joints are:
Durability under SIGKILL
— what happens to an in-flight LLM revision when the pod dies.
Audit-trail completeness
— what the database actually remembers about which actor moved a thread through which state.
Cost metering
— what the system can tell you, before you get the Anthropic invoice, about how much an LLM run cost.
Model provenance in caches
— what happens to your compliance findings when Anthropic ships a new model snapshot.
Transactional semantics vs. ad-hoc state machines
— whether the locks and contracts you wrote actually hold the invariants you think they hold.
The fifth joint is the headline. It contains a single piece of evidence that frames the entire paper: the author of one branch correctly used the claude-agent-sdk 's native interrupt primitive ( can_use_tool ) for one path in one file 1 , and chose not to use it for two adjacent paths in the same file — implementing instead a JSON-to-file write-and-poll loop wrapped in up to three full retries of a fifty-turn subprocess 2 . The author knew. He chose the file-poll contract anyway. The PR comments record some of the deferrals openly 3 ; others are not discussed at all. This paper takes that evidence at face value.
A prior internal review of these same branches enumerated fourteen patterns and called them production wisdom 4 . The review was not wrong about what the code does. The review was over-charitable about what the code guarantees, and the gap between “does” and “guarantees” is the whole subject of this paper. Where it helps the reader, we will quote the prior review's framing alongside the corrected reading. The point is not to embarrass anyone. The point is to show how easily a clean-looking diff can hide a structural absence that a non-AI codebase would have surfaced immediately.
The praise that should have been a warning
Before we walk the five joints, a brief calibration on how this codebase was originally read.
The internal review described the cache key on the drift findings table — (workspace_id, codebase_id, pack_id, pack_version, commit_sha) 5 — as a “clean, teachable resolution” to the problem of caching a non-pure function 6 . The framing was: key on the full set of inputs, accept that the cached value is now the canonical answer for that key, and a rubric change will invalidate naturally because pack_version is in the key. The review then ranked the pattern in its top three most instructive 7 .
That framing is locally true. It is also wrong in a way that only becomes visible when the reader holds the complete set of inputs in mind. The cached value is the output of an LLM call. The cache key lists every input the codebase controls — workspace, codebase, pack, commit — and omits the single most important input nobody on the team owns: the model. When Anthropic ships a new snapshot — which they do regularly, and which they did in February — every cached verdict produced by the prior snapshot continues to be served as authoritative on the next request, indistinguishable from a fresh verdict. A compliance finding that says “your codebase fails obligation X-3.2” and that was produced by a deprecated model is now a liability — not because the verdict is necessarily wrong, but because nobody can tell.
We will return to this in Joint 4. The point here is methodological. A senior reviewer, reading carefully, saw a clean pattern and praised it. The pattern was clean for the failure modes the reviewer was scanning for. It was unclean for a failure mode the reviewer was not scanning for, because the reviewer was scanning for cache-correctness problems that an LLM-less system would have, not for cache-correctness problems that only an LLM-integrated system has. The first habit this paper is trying to break is the habit of declaring code production-ready on the strength of how it looks against the patterns of a discipline that does not yet know what it is missing.
The same diagnosis applies to most of the fourteen patterns in the original review. The output contract enforced by Pydantic 8 is a reasonable defense against malformed LLM output — and the original review correctly named it as such — but it is enforced after the subprocess has run to completion, after up to three retries of fifty turns each, with no use of the SDK's native interrupt primitive that would have made the contract enforceable mid-flight at a fraction of the cost. The conditional-UPDATE thread mutex 9 does serialize access to a thread within a single Postgres connection at read-committed — and the original review correctly named that property — but it does not serialize access across an LLM-processing window that lasts seconds to minutes, and the original review did not name that absence. Each pattern is correctly identified at one level of analysis and wrong at the next. The job of the rest of this paper is to do that next level.
Joint 1 — Durability under SIGKILL
The first joint is the simplest to state. A revision run begins when a user @-mentions Kai with a change request. The orchestrator flips the project to THINKING , snapshots every affected document, materializes them as files in a temp directory, calls Kai, reads the modified files back, writes the new content to the database, and reconciles comment-thread anchor positions against the diff 10 .
The happy path is competent. The interesting question is what happens on the unhappy path.
The unhappy path is documented in the code as asyncio.CancelledError 11 . When the task is cancelled — and only when it is cancelled — the orchestrator iterates through an in-memory dictionary called doc_contents and restores each document to its pre-revision content 12 . doc_contents is populated before the LLM run begins 13 , so the buffer always contains something to restore. The handler then calls abort_project_revision(run_id, AbortOutcome.CANCELLED) to record the terminal state.
The original review called this “a proper transactional model” and “what a production revision system requires” 14 . It is neither. It is a Python-level imitation of a transactional model that works exclusively for the failure mode the Python runtime can observe.
The cases the runtime cannot observe are the cases that matter in production. When a Kubernetes pod is OOM-killed, the kernel sends SIGKILL and the process is terminated without unwinding the asyncio event loop. The CancelledError handler does not fire. The doc_contents dictionary, which lives in the process's heap, is freed by the kernel along with the rest of the address space. There is no restore. The database is left with whatever state the orchestrator wrote before the kill — which, depending on timing, may include some documents already updated to their new content and others still pending, with the project status still pinned at THINKING .
When the replacement pod comes up, nothing reaps the orphaned run. The sibling drift branch does have a startup reaper for orphaned drift runs ( reap_orphans , called at orchestrator init 15 ). The revision branch has no equivalent. A pod crash in the middle of a Kai revision leaves the project in THINKING forever, the UI blocked, the comment threads in revising derived state, and a partial mutation in the document table with no marker indicating that the run died.
The author of the PR did not pretend otherwise. A reviewer asked about pod-crash recovery and the PR author responded, verbatim:
Added a 5-minute default timeout… Doesn't cover pod-crash recovery (thread stuck in thinking when the process dies) — that's a separate problem we've deferred. 16
This is an honest deferral. It is also the kind of deferral that, in a non-AI service, every engineer on the team would treat as a release blocker. A payment service that could leave a transaction in THINKING forever on a pod kill would not ship. A document service that does the same — but with an LLM in the loop — shipped, was reviewed favorably, and was held up as instructive. The standard practice — try/except, in-memory rollback buffer, finally block — looks the same in both services. In one service, the standard practice is sufficient. In the other, it is theater.
The production-grade version is not exotic. PostgreSQL has had savepoints since the 1990s. The orchestrator could have opened a transaction at the start of the run, taken SAVEPOINT before_revision before any mutation, performed all document updates inside that transaction, and on the normal completion path issued RELEASE SAVEPOINT and COMMIT . On the cancellation path it would ROLLBACK TO SAVEPOINT before_revision and COMMIT (committing the run row as cancelled, releasing the project lock). On a pod kill, the connection would simply die. The Postgres backend would observe the disconnect and roll back the in-progress transaction automatically. The database — the durable substrate — would never see the partial mutation, because the partial mutation would never have committed.
A startup reaper would then handle the project-status flip. The same pattern as reap_orphans : on orchestrator startup, scan for projects in THINKING whose associated run row is RUNNING and whose worker heartbeat is older than the kill threshold; transition them to PENDING_REVIEW and mark the run as FAILED with reason worker_died . This is not novel infrastructure. It is the same defensive pattern that exists for drift runs, in the same codebase, in the next directory over. The asymmetry is the diagnosis: the engineer who built the drift orchestrator wrote the reaper; the engineer who built the revision orchestrator did not, and nobody in PR review noticed that the equivalent failure mode existed.
The in-memory doc_contents rollback buffer is not wrong. It is insufficient. The mistake is calling it a transactional model. A transactional model is a property of the durable substrate, not a property of a Python coroutine. When the standard practice (“wrap mutations in try/except, restore from a buffer on cancel”) is applied without an understanding of the new failure mode the LLM introduces (“subprocess so slow that a 5-minute timeout is reasonable, in a runtime that gets SIGKILL 'd routinely”), the result is code that looks transactional and is not.
Joint 2 — Audit-trail completeness
The second joint is structural. A comment thread in this system moves through states: open , thinking , resolved , plus a derived state revising that is enriched at read time when a thread's comments belong to an active changelist 17 . The transitions are driven by three SQL statements: set_thread_thinking flips open → thinking , complete_thread_thinking flips thinking → open , and add_document_comment blocks writes against any thread that is not open unless the writer is Kai 18 .
There is no thread status history table.
The current status is overwritten in place every time it changes. A reader looking at the database three minutes after a Kai mention sees status = 'open' and has no way, from the data, to know that the thread was ever thinking , when it transitioned, which Kai run handled the mention, or whether it was a Kai run at all. The information exists — for a while — in the application's structured log stream and in the SSE event stream 19 — but those are operational telemetry, not durable system-of-record. Once log retention rotates, the only evidence that an LLM ever touched the thread is the comment Kai posted at the end.
The original review called the conditional-UPDATE mutex “one of the most instructive patterns in the branch” because it “makes the concurrency problem visible as a state machine” 20 . The state machine is visible at the point of transition. It is invisible afterward. A state machine you cannot replay is not a state machine; it is a switch. A switch is fine in a system where nobody will ever ask “when did this last flip and who flipped it.” A system where one of the actors is a non-deterministic, billable, fallible subprocess — and where users will ask, in good faith, “did Kai look at this?” — needs something stronger.
The exemption pattern compounds the problem. Kai is identified throughout the schema by the absence of a value: author_user_id IS NULL means Kai 21 . This is a magic NULL check, not a type invariant. There is no actor_kind enum, no foreign key to a service_accounts table, no row in any table that declares “Kai is an entity, here is its ID.” A row whose author_user_id is NULL is a Kai row, by convention, by code. The first time someone runs a data-cleanup migration and accidentally inserts a row with a missing author_user_id , that row becomes indistinguishable from a Kai post. The first time another agent (Mira, Anton, anyone) is added to the system, the schema cannot represent the distinction. There is no version of this schema in which “show me everything Kai did last Tuesday” is a SQL query against integrity-constrained columns; it is a heuristic.
A financial system cannot trust that “Kai processed this thread” is recoverable from the data. It is recoverable only from process logs. Process logs that may have rotated. Process logs that, in any system with regulatory exposure, are not where audit trails live.
The production-grade version is mechanical. Every status mutation writes one row to a thread_status_history table with the shape:
thread_status_history — audit table schemaSQL
CREATE TABLE thread_status_history (
id BIGSERIAL PRIMARY KEY,
thread_id UUID NOT NULL REFERENCES document_comment_threads(id),
prior_status TEXT NOT NULL,
new_status TEXT NOT NULL,
actor_kind TEXT NOT NULL, -- 'user' | 'kai' | 'system'
actor_id UUID, -- user_id when actor_kind = 'user'
run_id UUID, -- revision_run.id when triggered by a run
transition_ts TIMESTAMPTZ NOT NULL DEFAULT now(),
CHECK (
(actor_kind = 'user' AND actor_id IS NOT NULL) OR
(actor_kind IN ('kai', 'system') AND actor_id IS NULL)
)
);The mutation path becomes a two-statement transaction: update the row, insert the history. The audit query becomes a SELECT against thread_status_history ordered by transition_ts . The “did Kai do this” question becomes WHERE actor_kind = 'kai' . The “when did this thread last transition out of thinking ” question becomes MAX(transition_ts) WHERE new_status = 'open' AND prior_status = 'thinking' .
This is not exotic. Every event-sourced system in the world is built this way. The Rails ecosystem ships paper_trail and audited for exactly this reason; the Django ecosystem ships django-simple-history . Adding an LLM to the system does not change whether the audit trail is necessary. It changes how immediately the audit trail is necessary — because the moment a customer asks “why did your AI flag this comment as a change request” and the answer requires explaining a model snapshot that was deprecated last week, the data layer is the only honest source of truth. The fact that this code shipped without it, and was reviewed favorably, is a measure of how blind the standard-practice playbook is to the failure modes that show up only when one of the actors is metered, non-deterministic, and serving regulated content.
Joint 3 — Cost metering
The third joint is the one the team knows about and has decided to defer. The honesty is admirable. The deferral is not safe.
Inside the drift orchestrator, after a run completes, the code finalizes the run row with cost and token fields explicitly set to zero, accompanied by a comment that the SDK does not yet expose token usage:
orchestrator.py:141-145 — zero cost finalizationPython
# orchestrator.py:141-145 await db.finalize_drift_run( run_id, findings_count=len(all_findings), cost_cents=0, # Kai doesn't yet expose token usage tokens=0, )
The finalization call is in the drift orchestrator of KellerAI's internal codebase 22 .
A drift run is webhook-triggered: a push to a tracked branch fires _handle_push , which spawns a run against the codebase as it exists at the resolved checkout SHA 23 . The per-pack turn cap is drift_kai_max_turns: int = 200 24 . A “pack” is one governance pack; a workspace can have many. A real customer codebase — the kind this system is meant to scan — is roughly 500K lines of code across an enterprise repository. The drift master kill switch ( drift_enabled ) is binary 25 : on, or off. There is no per-workspace quota. There is no per-pack budget. There is no rate limit on webhook-triggered runs. There is no cost alert. There is no upper bound on how much a single push to a noisy branch — say, a release-train branch that accumulates 30 merges a day — can spend.
Multiply: 30 pushes × N packs × 200 turns × 500K LOC of context-eligible source × the cost of a frontier-model token, and the failure mode is not “the system runs slow.” The failure mode is “the next invoice from Anthropic arrives, and no one on the team can explain it.”
The original review correctly flagged the turn cap, the master switch, per-pack failure isolation, and the auto-clear scoping as a “cost and blast-radius control” pattern 26 . The review's framing was that these guardrails are a complete production-AI response. They are not. They are a partial response. A turn cap bounds a single subprocess invocation. A master switch lets you stop all spend, abruptly, after the spike has been noticed. Neither answers the question that finance will ask, which is “how much did this workspace cost us last month, and is that more than we're billing them for it.”
The deferral is also visible in the cache. A cached drift verdict 27 does not store the cost of the run that produced it; the cache row simply has the verdict. A re-evaluation against a new pack version pays the full cost again, with no offset for the cost of the cache miss that produced the cached row in the first place. Cost, in this system, is not a number that exists anywhere. It is a property of the Anthropic invoice.
The production-grade version requires three changes. First, extract usage from the SDK's response objects. The claude-agent-sdk exposes usage on the response stream — every turn has an input-token count, an output-token count, and a cache-read/cache-write breakdown. The orchestrator can aggregate these as the subprocess runs, persist them on the run row, and have a real cost number at finalization. Second, write a kai_run_usage table: one row per run, with workspace ID, codebase ID, pack ID, model snapshot, input tokens, output tokens, cache tokens, computed cost in cents, and the run's start and end timestamps. This is the data shape every billing dashboard needs and that no application can extract from Anthropic's invoice retroactively at workspace granularity. Third, enforce a monthly budget at the orchestrator layer. Before invoking Kai, sum kai_run_usage.cost_cents for the workspace in the current billing period. If the sum exceeds the workspace's quota, refuse to launch the run with a structured event ( drift.run.quota_exceeded ). A workspace that hits its quota gets a clear UI message, not a surprise invoice.
The PR author has acknowledged that the SDK does not yet expose this surface 28 . That acknowledgment was correct months ago. It is no longer accurate: the SDK surface for usage telemetry has stabilized. The shipping system continues to record zero. That is not a deferral anymore; it is an unmaintained TODO. The cost meter, like the audit trail and the SIGKILL handler, is one of those pieces of plumbing that a non-AI service would never have shipped without and that an AI service shipped, was reviewed favorably, and was held up as instructive — because the standard playbook does not have a chapter on “your subprocess invoices you per token.”
Joint 4 — Model provenance in caches
The fourth joint is the one the original review actively praised and got wrong. This section reproduces the original review's celebration so the reader can see the misstep, then explains it.
Pattern 5 — Caching non-deterministic LLM output by content key. Kai's per-pack output is cached in a drift_findings_cache table keyed on the tuple (workspace_id, codebase_id, pack_id, pack_version, commit_sha) . Before invoking Kai, the orchestrator looks up that key; a hit returns the prior findings and skips the LLM call entirely. The key is content-derived, so any real input change — a new commit, a pack version bump — invalidates the cache naturally without explicit eviction. 29 “How do you cache something that is, by definition, not a pure function?” is one of the hardest questions in production AI. The answer here — make the cache key the full set of inputs and accept that the cached value is now the canonical answer for that key — is a clean, teachable resolution. 30
The cache key is, verbatim, in migration 018_drift_findings_cache.py :
migration 018_drift_findings_cache.py:37-51 — cache UniqueConstraintPython
# migration 018_drift_findings_cache.py:37-51 sa.UniqueConstraint( "workspace_id", "codebase_id", "pack_id", "pack_version", "commit_sha", name="uq_drift_findings_cache_input_tuple", ),
Cited at 018_drift_findings_cache.py:37-51 31 .
The original framing is correct as far as it goes: the key encodes the inputs the codebase controls, and the cached value is the canonical answer for that key. The framing is wrong because the codebase does not control the most important input. The model that produced the verdict is determined inside _run_task via ClaudeAgentOptions — the model identifier is a parameter set on the SDK call, not a column on the cache row 32 . When the deployment configuration updates that parameter to a new snapshot — which the team will do, because Anthropic deprecates old snapshots on a schedule — every existing row in drift_findings_cache continues to be served as authoritative on the next request.
The verdict says “your codebase fails compliance obligation X-3.2,” with a citation to an authentication module, with a severity of 4. The verdict was produced by a model that no longer exists. The next user who opens the compliance dashboard sees the verdict, opens a ticket against the engineering team, the engineering team fixes the code, the next webhook fires, the cache misses on the new commit SHA, the new model is invoked, and the new model — applying a different rubric, or simply being a different model — does not flag the file. What the team experiences is: “Kai flagged this. We fixed it. The flag went away.” What actually happened is: “A retired model produced a verdict; we kept serving it; we re-evaluated under a new model on a new commit; the new model produced no verdict; we mistook the absence of a flag for evidence of fix.”
The original review missed this because the original review was scanning for cache-correctness in the conventional sense — that input changes invalidate the cache — and not for cache-correctness in the LLM-specific sense — that the evaluator is also an input. This is the cleanest example in the paper of what the thesis claims: the LLM-praised, “enterprise-grade” code is correct against the patterns of a discipline that does not know what it is missing. A pure-function cache that omits the function's identity is, formally, a bug. The same bug would be obvious if the function were a versioned external service; it is invisible when the function is a model snapshot, because nobody on the team has the habit of thinking of the model as a versioned external service.
The production-grade version is, again, mechanical. Add model_version and prompt_hash to the cache key:
Corrected UniqueConstraint — model_version and prompt_hash addedPython
sa.UniqueConstraint( "workspace_id", "codebase_id", "pack_id", "pack_version", "model_version", "prompt_hash", "commit_sha", name="uq_drift_findings_cache_input_tuple", ),
model_version is the exact snapshot string Anthropic returns in the response ( claude-3-7-sonnet-20250219 , not claude-3-7-sonnet ). prompt_hash is the SHA-256 of the rendered prompt template after Jinja substitution but before model invocation — capturing the case where the prompt is updated without the pack version being bumped. Cache reads filter on the currently deployed model. A model snapshot change becomes a cache miss for every row produced by the prior snapshot; the system re-evaluates those rows lazily as users open the dashboards. Stale verdicts age out the same day the model rolls forward, not on the next commit.
A secondary improvement, optional but valuable: a model_version column with no constraint on the write path but a WHERE clause on the read path allows the system to keep historical rows for audit (“here is what the prior model said in February”) while serving only fresh rows to the UI (“here is what the current model says today”). The audit trail problem from Joint 2 connects here: a finding that no longer holds is not gone, it is outdated, and a regulator who asks “why was this finding marked failing on March 1” deserves an answer that points to a specific model snapshot, a specific prompt hash, and a specific run, not “the cache said so.”
The misstep in the original review is not catastrophic. It is instructive. A senior engineer, reviewing carefully, looked at a clean pattern and applauded it. The pattern was clean against the failure modes the reviewer was scanning for. It was unclean against a failure mode the reviewer was not scanning for, because the reviewer's mental model — built from a career of caching deterministic functions — did not have “the function is itself versioned” as a category. The bill from this oversight is paid by the customer the next time the customer trusts an outdated finding.
Joint 5 — Transactional semantics vs. ad-hoc state machines
This is the headline joint. It is where the standard playbook fails most visibly, and it is where the smoking gun lives.
The smoking gun is in the agent orchestration module of KellerAI's internal codebase. In the function responsible for handling tool interrupts, the code correctly wires the claude-agent-sdk 's native interrupt callback, can_use_tool , for AskUserQuestion 33 . can_use_tool is the SDK's first-class primitive for structured mid-flight interaction: when the model attempts to use a registered tool, the SDK suspends the subprocess at the tool-call boundary, hands the proposed input to the host application as a typed object, and resumes the subprocess with the host's response when the host returns one. There is no JSON file on disk. There is no polling. There is no retry. The contract is enforced by the SDK at the point of the interrupt.
The same file, three function definitions later, refuses to use that primitive for classification. classify_kai_mention at lines 772–787 instead renders a Jinja template that instructs the model to write its classification result to a file called _CLASSIFICATION.json in the interaction directory, runs the subprocess to completion (up to 50 turns), reads the file from disk after the subprocess exits, and calls ClassificationResult.model_validate on the parsed JSON 34 . If the file is missing, the surrounding _run_task loop at lines 567–583 re-prompts the model and resumes the session — paying the prior turns' context tokens again — and tries again, up to max_generation_tries=3 35 . Three retries of a 50-turn subprocess. The same pattern repeats at generate_revision , lines 809–818 36 .
The author knew. The author used can_use_tool for AskUserQuestion , in the same file, 300 lines earlier. The PR comments do not record a reason for the asymmetry. The synthesis cannot pretend the choice was justified — the design rationale is silent in the record. What we can say: a structured-output contract enforceable at the tool-call boundary was available, was understood by the author, and was not used. A file-poll-retry contract was implemented in its place, with worse latency, worse cost, worse failure semantics, and worse observability. This is what it looks like when a standard practice — “validate model output with Pydantic” — is applied without a grasp of the primitive the SDK provides for doing the same job correctly.
There are three other ad-hoc state machines in the same branch that have similar pathologies.
The conditional-UPDATE mutex. The thread-locking pattern uses UPDATE document_comment_threads SET status = 'thinking' WHERE id = $1 AND status = 'open' and checks the affected-row count 37 . This is described in the original review as a “race-safe” mutex 38 . It is race-safe for the duration of the UPDATE statement. It is not a mutex across the LLM-processing window.
The reason is a property of read-committed isolation that every Postgres-using engineer learns and that almost every Postgres-using engineer forgets at exactly the wrong moment. The row-exclusive lock taken by an UPDATE is released when the transaction commits. The UPDATE here lives in a short-lived transaction (the asyncpg call commits immediately). After commit, the row's status is thinking , but no lock is held. The Kai subprocess then runs for seconds to minutes. During that window, a concurrent reader whose transaction began before the UPDATE committed will, under read-committed, observe whatever snapshot it was given — depending on transaction timing, it may see open and proceed, in particular through the codepath that allows non-Kai writes when status is open .
The PR author wrote, of this same area, verbatim:
To frame the scope of this PR: the goal here is to make sure the database never enters an invalid state and all race conditions are handled — not necessarily gracefully yet. How we present these to the user (retry logic, advisory locks, etc.) we can loop back on. 39
“Advisory locks, etc., we can loop back on” is the right escape hatch. The advisory-lock layer is what holds the invariant across the LLM window. The version that holds the invariant is pg_advisory_xact_lock(hashtext('thread:' || $thread_id)) , taken inside a transaction at the start of the Kai run and held for the duration. Concurrent callers attempting the same lock block until the LLM completes (or until a statement timeout fires). The conditional-UPDATE pattern is then a cheap optimistic check in front of the lock — it short-circuits the duplicate request before the lock contention — but the lock itself is what enforces the mutex. What shipped is the optimistic check without the lock behind it. What shipped looks correct because the original review knew the conditional-UPDATE pattern from non-AI codebases where it is correct, and did not notice that the addition of an LLM extends the critical section by three orders of magnitude.
PostgreSQL NOTIFY/LISTEN with no reconnect. The SSE event bus is built on a single asyncpg.Connection that issues LISTEN document_events 40 , with a callback that fans events out to per-subscriber asyncio queues 41 . When the underlying connection drops — and Postgres connections drop, for reasons ranging from PgBouncer recycling to network partitions to upgrade events — the callback ceases to fire. The asyncpg connection object remains an attribute on SubscriptionManager . No reconnect logic re-establishes the listener. Every SSE subscriber connected to this server instance goes silently dark — the in-memory queues stop receiving puts, the queue.get() in each subscriber's async generator blocks forever, and sse_starlette 's heartbeat factory continues to ping the client every 30 seconds with comment: heartbeat 42 .
From the browser, the connection looks alive. The application looks alive. The browser will not retry the SSE connection because the SSE connection is, by every observable signal, working. And no events flow.
There is no alert on listener death. There is no health check that asserts “the LISTEN connection has fired at least one notification in the last N minutes.” There is no test in the test file that takes the connection down and asserts the system recovers; the test that exists round-trips the serialization format, not the connection lifecycle 43 . This is the failure mode where a customer reports “the UI stops updating after a while,” the on-call engineer SSHes in, sees green dashboards, sees connected clients, sees no errors in the application log, and has no way to make the problem visible without first knowing that NOTIFY/LISTEN reconnect is a thing. A non-AI service that depended on this same pattern would, eventually, also fail this way; the difference with an AI service is that the AI is the primary reason users care about real-time updates (“Kai is thinking…” status is the central UX affordance), so the surface area of the silent failure is the entire feature.
The production-grade version is a durable event log — either Postgres logical replication via something like wal2json consumed by a stable reader, or an actual broker (Kafka, NATS, even Redis Streams) — with explicit dead-letter handling and a reconnect-with-backoff supervisor. The cheap version, if a broker is overkill, is the same LISTEN connection wrapped in a supervisor coroutine that detects disconnection via asyncpg's connection events, re-establishes the connection, and re-issues LISTEN — with a structured event ( sse.listener.reconnected ) on every recovery and an alert when the time between events exceeds the expected heartbeat by a wide margin.
difflib.SequenceMatcher for anchor reconciliation. The fifth ad-hoc state machine is the reconcile_positions function that maps comment-thread anchors across the diff 44 . It uses difflib.SequenceMatcher to produce opcodes, builds a line-to-line map, looks up each anchor's start and end line, and returns a new anchor with updated line numbers and preserved column offsets45 .
The line mapping is approximately right for trivial edits. It is wrong for real edits in two ways. First, the replace opcode maps old-to-new positionally up to min(old_count, new_count) ; a replace block where the LLM rewrote three lines into five has the first three new lines paired with the three old lines and the last two new lines orphaned with no mapping — but the anchored text in the user's comment might point to the content of one of the old lines, not the position, and a positional mapping will silently point the anchor at semantically unrelated new content. Second, the column-preservation rule is a lie of convenience: the columns are 0-indexed character offsets into the line 46 , and a line whose content changed cannot reasonably preserve its column offsets. A comment anchored at columns 12–34 of line 47, where Kai rewrote line 47 and the new line is 18 characters long, has its anchor pointing into columns that do not exist. The UI's “position lost” affordance triggers only when the line is deleted; anchors against rewritten content silently highlight the wrong substring.
This is not a failure of the LLM. It is a failure of the standard practice — “use difflib for diff computation, it's in the standard library” — applied without an understanding of what comment anchors actually need to survive. The right primitive is a tree-edit-aware diff over the syntactic representation, or a CRDT that maintains anchor positions through structural edits as a first-class concern, or — at minimum — a content-based matching layer that locates the anchor's original text in the new document and refuses to point the anchor anywhere if the original text is no longer present in a sufficiently similar context. difflib.SequenceMatcher , in this layer, is the equivalent of using str.replace for HTML escaping. It works for the cases the author tested. It fails for the cases the LLM produces in practice, silently, by lighting up the wrong substring on every meaningfully restructured edit.
Closing — what the bill looks like
The five joints are not a list of bugs. They are five instances of a single pattern.
At each joint, an engineer encountered a problem that the standard software-engineering playbook has a chapter on. At each joint, the engineer applied the chapter. At each joint, the chapter was insufficient — not wrong, insufficient — because the LLM in the loop has a property that the chapter was not written for.
The in-memory rollback buffer (Joint 1) is the standard exception-handling chapter, applied without the chapter on durable substrates for processes that get SIGKILL 'd. The status-overwrite (Joint 2) is the standard CRUD chapter, applied without the chapter on audit trails for systems whose actors include non-deterministic services. The zero-cost meter (Joint 3) is the standard deferral chapter, applied without the chapter on metered subprocesses that bill per token. The content-key cache (Joint 4) is the standard caching chapter, applied without the chapter on versioned external evaluators. The conditional-UPDATE mutex (Joint 5) is the standard Postgres-concurrency chapter, applied without the chapter on critical sections whose duration is determined by an LLM.
In every case, the code looked normal. In every case, the LLM, asked to review the diff, praised the code. In every case, the prior review of the entire branch identified the pattern, named the trade-off the author had made, and called it instructive. In every case, the failure mode is real and is not theoretical and will surface — in production, under load, from a customer report, from an Anthropic invoice, from a regulator's inquiry — at the moment when the team can least afford to be reading the source code with fresh eyes.
The thesis is uncomfortable on purpose. The engineering on these branches is not bad. The engineers are not negligent. The reviews were not careless. The product works in the demo. And every one of the failure modes documented above will, with high probability, occur in production under conditions the team has not yet rehearsed.
The bill always comes. What the bill looks like is not the spike on a graph; it is the slow accumulation of small structural absences that, individually, look like reasonable deferrals, and collectively form a system whose load-bearing joints are imitations of the load-bearing joints a non-AI version of the same system would have. The LLM, asked to read the PR diff, will praise the imitations. A senior engineer, scanning for the failure modes of the prior discipline, will applaud the patterns. Neither will catch the gaps. The gaps will be caught by production, in the customer's pocket.
This paper has a sibling: a brief, also titled The Subprocess Contract, that frames the same thesis without the engineering detail. This paper has two siblings still to be written. The first, Trust-but-Verify , takes up the question this paper raises in Joint 4 and Joint 2 — what does it mean to trust an LLM verdict, and what verification infrastructure makes that trust auditable. The second, Citations and Decision Traces , addresses the surface this paper barely touched — that an LLM finding without a trace from input to evidence to verdict is not a finding, it is a rumor, and the standard practice for rendering rumors as findings is the practice that will collapse first under regulatory scrutiny.
The series is one long argument: standard practices are necessary and insufficient. The “and insufficient” is where the work lives. The bill arrives whether or not the work gets done. Done early, it is engineering. Done late, it is incident response. Done not at all, it is the next customer-reported outage that the team cannot explain — because the LLM, asked what went wrong, will praise the code.
Standard practices are necessary and insufficient. The “and insufficient” is where the work lives. The bill arrives whether or not the work gets done. Done early, it is engineering. Done late, it is incident response.