Skip to main content
kellerai.blog

The Bill Always Comes

When engineers apply standard software practices to AI problems without a firm grasp of both, the architectural mistakes pile up in the blind spot of their productivity.

KellerAI White Paper · Code Quality & Architecture · May 2026

Context

When engineers apply standard software practices to AI problems without understanding the second discipline—how an LLM behaves as a runtime actor—the LLM grades the code and praises it, while architectural mistakes pile up in the blind spot of their new-found productivity. The code looks defensive, clean, and production-grade; the failures are real and will surface when the team can least afford to diagnose them.

The Finding

Standard practices are necessary and insufficient. The in-memory rollback buffer, the Pydantic schema validation, the database mutex—each is a pattern from a non-AI codebase applied without the chapter on processes that get SIGKILL'd, non-deterministic actors that cost per token, and cache keys that omit their own evaluator. The remedy is not a new pattern; it is requiring a human who understands both halves in the review loop before the diff lands.

Tags:
Code Quality & AI ArchitectureTechnical Debt in AI SystemsProduction Engineering Discipline
Paper Details
CategoryCode Quality & Architecture
AudienceEngineering teams and technical leaders building AI-integrated systems
MethodSource code analysis of production branches + architectural pattern audit + failure-mode walkthrough
Length~1,180 · 5 min
Sections6
DateMay 2026
AuthorsKellerAI
Read the full paper
Related
Placeholder — pending analytics
Section 01

The praise that should have been a warning

A developer sits down to add a small AI feature to a real product. They write the kind of code their training tells them to write — a state machine for the AI's processing state, a transactional snapshot before the model runs, a try / finally that restores the snapshot on cancellation, a Pydantic schema enforcing the model's structured output, a database lock so two users cannot summon the model into the same thread at once.

The LLM in their editor calls it elegant. It calls it defensive. It calls it production-grade. The reviewer agrees, the PR lands, and the honeymoon begins.

The code is competent. The patterns are recognizable from any senior engineer's toolkit. And in two real branches we recently reviewed on a production platform, those patterns sit directly next to architectural mistakes the same engineer would have caught instantly in a non-AI codebase — because in a non-AI codebase, another engineer would have asked the obvious question. With an LLM in the loop, no one asked. The LLM admired.

Applying standard software-engineering practices to AI-engineering problems without a firm grasp of both does not stabilize the system. It accelerates the tech debt while convincing everyone the codebase is clean. The bill always comes.

Section 02

What the bill looks like

Consider a feature that classifies a user's chat mention and decides whether it is a question or a change request. The framework being used — Anthropic's claude-agent-sdk — provides a native primitive for exactly this kind of interrupt-driven typed exchange: can_use_tool , a callback the SDK fires before any tool call, where the host application can validate or modify arguments at the interrupt boundary. The engineer is aware of it. In the same file, a few hundred lines away, they have already wired can_use_tool correctly for an AskUserQuestion flow.

For classification, they choose differently. The classifier writes a JSON file to disk; a polling loop watches for the file; a Pydantic model parses it after the fact; and if the file is missing or malformed the entire fifty-turn subprocess is restarted, up to three times. A classification that lands on the third retry has burned up to a hundred and fifty model turns and produced no record of which attempt was actually accepted. The native primitive would have enforced the schema at the interrupt boundary before a single turn was spent.

Now consider a separate feature — a compliance auditor that runs an LLM-as-judge over a customer's codebase and caches the verdict so a re-run on the same commit is free. The cache key is (workspace, codebase, pack_id, pack_version, resolved_sha) . It is missing one field: the model version. The first time the underlying model is upgraded — a snapshot bump, a provider's silent rollout — every cached verdict from the prior model continues to be served as authoritative compliance evidence. There is no eviction, because nothing in the key changed. The new model is never asked. The customer's audit log shows clean verdicts produced by a model that no longer exists.

Neither of these is bad engineering by the conventional definition. Both files are well-structured, well-tested, well-commented. Both passed review.

Section 03

Why the bill compounds

Each gap, taken alone, looks like reasonable triage. When a reviewer on the first feature asked about a race condition in the get-or-create path, the engineer answered:

“To frame the scope of this PR: the goal here is to make sure the database never enters an invalid state and all race conditions are handled — not necessarily gracefully yet. How we present these to the user (retry logic, advisory locks, etc.) we can loop back on.”

When a different reviewer asked about a SIGKILL gap in the in-memory rollback buffer, the same engineer wrote:

“Added a 5-minute default timeout… Doesn't cover pod-crash recovery (thread stuck in thinking when the process dies) — that's a separate problem we've deferred.”

Read in isolation, every one of these deferrals is responsible. Stacked, they describe a different system: one that survives the happy path beautifully and silently corrupts itself under any failure the team has not yet found time to model. The conditional UPDATE … WHERE status = 'open' mutex holds the row lock only for the UPDATE itself, not for the seconds to minutes of LLM processing that follow. The PostgreSQL NOTIFY / LISTEN bus runs on a single connection that never reconnects; a connection drop means every browser subscriber goes silently dark while still receiving heartbeats. The difflib -based position reconciler preserves column offsets unchanged after a line is rewritten, leaving anchors pointed at characters that no longer mean what they meant. Cost meters are hard-coded to zero — there is no per-request observability into spend.

None of this is hidden. All of it is shipped.

Section 04

The blind spot the LLM creates

The non-AI version of any of these files would have been read by another engineer with the same training as the author. That engineer would have asked the obvious question — why are you not using the native primitive? what happens to the cache when the model changes? what holds the lock while the model is thinking? — because the question is obvious in an environment where the reader and writer share the same expectations .

The LLM does not share expectations. It pattern-matches on surface shape. A file with a try / finally , a Pydantic model, and a typed enum looks defensive to it, regardless of whether the defenses point at the actual failure modes of the system being defended. Worse, the LLM tells the engineer the code is defensive — in plain prose, with confidence — and the engineer feels the warm validation of a senior peer review. It is not a senior peer review. It is a stylistic match against a corpus that does not contain this system.

The engineer feels twice as productive. They are not twice as productive. They are producing code at twice the speed, with the review function silently broken.

Section 05

What this means for teams building AI products

Three things to require of yourself before you trust your own production AI code:

  • Read the SDK and the library before you reinvent its primitives.

    If the framework you depend on offers a native interrupt, a native validator, a native event, the burden is on you to justify why you wrote a polling loop instead.

  • Treat every “we can loop back on that” as a permanent commitment.

    The deferral is not a delay. It is the design, until proven otherwise.

  • Get a human who understands both halves in the review loop.

    The LLM cannot tell you that your AI-shaped code is missing the AI-shaped defenses. Another engineer with both the SWE and the AI-engineering background can. Until that person reads the diff, the diff is not reviewed.

Section 06

The companion technical paper

For the file-and-line walkthrough of every gap named here — the file-poll retry loop, the cache key missing model_version , the conditional-UPDATE mutex's true scope, the NOTIFY/LISTEN reconnect bug, the difflib column-anchor failure, the zeroed cost meter — read the companion technical whitepaper, The Subprocess Contract: Production Patterns for LLM-Integrated Systems .