The Bill Always Comes: Why "Enterprise-Grade" AI Code Often Isn't

KellerAI

Section 01

The praise that should have been a warning

A developer sits down to add a small AI feature to a real product. They write the kind of code their training tells them to write — a state machine for the AI's processing state, a transactional snapshot before the model runs, a try / finally that restores the snapshot on cancellation, a Pydantic schema enforcing the model's structured output, a database lock so two users cannot summon the model into the same thread at once.

The LLM in their editor calls it elegant. It calls it defensive. It calls it production-grade. The reviewer agrees, the PR lands, and the honeymoon begins.

The code is competent. The patterns are recognizable from any senior engineer's toolkit. And in two real branches we recently reviewed on a production platform, those patterns sit directly next to architectural mistakes the same engineer would have caught instantly in a non-AI codebase — because in a non-AI codebase, another engineer would have asked the obvious question. With an LLM in the loop, no one asked. The LLM admired.

Applying standard software-engineering practices to AI-engineering problems without a firm grasp of both does not stabilize the system. It accelerates the tech debt while convincing everyone the codebase is clean. The bill always comes.

Section 02

What the bill looks like

Consider a feature that classifies a user's chat mention and decides whether it is a question or a change request. The framework being used — Anthropic's claude-agent-sdk — provides a native primitive for exactly this kind of interrupt-driven typed exchange: can_use_tool , a callback the SDK fires before any tool call, where the host application can validate or modify arguments at the interrupt boundary. The engineer is aware of it. In the same file, a few hundred lines away, they have already wired can_use_tool correctly for an AskUserQuestion flow.

For classification, they choose differently. The classifier writes a JSON file to disk; a polling loop watches for the file; a Pydantic model parses it after the fact; and if the file is missing or malformed the entire fifty-turn subprocess is restarted, up to three times. A classification that lands on the third retry has burned up to a hundred and fifty model turns and produced no record of which attempt was actually accepted. The native primitive would have enforced the schema at the interrupt boundary before a single turn was spent.

Now consider a separate feature — a compliance auditor that runs an LLM-as-judge over a customer's codebase and caches the verdict so a re-run on the same commit is free. The cache key is (workspace, codebase, pack_id, pack_version, resolved_sha) . It is missing one field: the model version. The first time the underlying model is upgraded — a snapshot bump, a provider's silent rollout — every cached verdict from the prior model continues to be served as authoritative compliance evidence. There is no eviction, because nothing in the key changed. The new model is never asked. The customer's audit log shows clean verdicts produced by a model that no longer exists.

Neither of these is bad engineering by the conventional definition. Both files are well-structured, well-tested, well-commented. Both passed review.

Section 03

Why the bill compounds

Each gap, taken alone, looks like reasonable triage. When a reviewer on the first feature asked about a race condition in the get-or-create path, the engineer answered:

“To frame the scope of this PR: the goal here is to make sure the database never enters an invalid state and all race conditions are handled — not necessarily gracefully yet. How we present these to the user (retry logic, advisory locks, etc.) we can loop back on.”

When a different reviewer asked about a SIGKILL gap in the in-memory rollback buffer, the same engineer wrote:

“Added a 5-minute default timeout… Doesn't cover pod-crash recovery (thread stuck in thinking when the process dies) — that's a separate problem we've deferred.”

Read in isolation, every one of these deferrals is responsible. Stacked, they describe a different system: one that survives the happy path beautifully and silently corrupts itself under any failure the team has not yet found time to model. The conditional UPDATE … WHERE status = 'open' mutex holds the row lock only for the UPDATE itself, not for the seconds to minutes of LLM processing that follow. The PostgreSQL NOTIFY / LISTEN bus runs on a single connection that never reconnects; a connection drop means every browser subscriber goes silently dark while still receiving heartbeats. The difflib -based position reconciler preserves column offsets unchanged after a line is rewritten, leaving anchors pointed at characters that no longer mean what they meant. Cost meters are hard-coded to zero — there is no per-request observability into spend.

None of this is hidden. All of it is shipped.

Section 05

What this means for teams building AI products

Three things to require of yourself before you trust your own production AI code:

Read the SDK and the library before you reinvent its primitives.

If the framework you depend on offers a native interrupt, a native validator, a native event, the burden is on you to justify why you wrote a polling loop instead.
Treat every “we can loop back on that” as a permanent commitment.

The deferral is not a delay. It is the design, until proven otherwise.
Get a human who understands both halves in the review loop.

The LLM cannot tell you that your AI-shaped code is missing the AI-shaped defenses. Another engineer with both the SWE and the AI-engineering background can. Until that person reads the diff, the diff is not reviewed.

Section 06

The companion technical paper

For the file-and-line walkthrough of every gap named here — the file-poll retry loop, the cache key missing model_version , the conditional-UPDATE mutex's true scope, the NOTIFY/LISTEN reconnect bug, the difflib column-anchor failure, the zeroed cost meter — read the companion technical whitepaper, The Subprocess Contract: Production Patterns for LLM-Integrated Systems .

The Bill Always Comes

Context

The Finding

The praise that should have been a warning

What the bill looks like

Why the bill compounds

The blind spot the LLM creates

What this means for teams building AI products

The companion technical paper