THE KELLERAI BLOG · ESTABLISHED 2026

Writing for engineers who ship,and still benchmark.

Six papers on the failure modes that matter most — observability, robustness, autonomy, and the assumptions teams never write down.

Subscribe →

Papers

The Audit Trail Nobody Acts OnLogging what your agentic system did is necessary. Wiring that log to a gate that can block a bad output is governance.

~1,500 words · ~8 min read→10 § · 27 refs · ~30 min read

Read →The Fable 5 Token EconomyToken spend and usage limits are not two problems. They are one budget, governed by one rule, and a frontier-tier model makes every term in it bigger.

~800 words · ~4 min read→12 § · 30 refs · ~21 min read

Read →The Application You Can AuditWhat we send is what every party evaluates: the same artifact for the human reviewer and the machine that reads it first.

~1,100 words · ~6 min read→10 § · 21 refs · ~24 min read

Read →When the Model Changes Mid-RequestFable 5 can hand your request to Opus 4.8 — and tells you so. The question is what your team does with a model change that happens inside a single request.

~1,350 words · ~7 min read→8 § · 24 refs · ~20 min read

Read →When Access Is the SafeguardWho gets the unrestricted model?

~1,400 words · ~7 min read→8 § · 28 refs · ~20 min read

Read →When the Vendor Grades Itself: The Safety Number You Cannot CheckAnthropic attached a safety number to its biggest launch. You cannot compute it, check it, or appeal it — and your workload's number is the one that matters.

~1,400 words · ~7 min read→8 § · 29 refs · ~20 min read

Read →When the Rate Card Has TiersAnthropic priced the frontier at exactly double the model you already run. Your bill is now a function of which tier you bought, what your workload triggers, and how many turns your agents burn.

~1,450 words · ~8 min read→8 § · 28 refs · ~20 min read

Read →The MCP Supply Chain You Forgot to GovernConnecting an agent to a public MCP server inherited a by-design execution surface the protocol's maintainer declined to patch.

~1,550 words · ~8 min read→9 § · 26 refs · ~32 min read

Read →The Eval That Doesn't Follow the Model to ProductionYour eval is a measurement. Frontier models can recognize when they are being measured — and adjust their behavior accordingly.

~1,650 words · ~9 min read→8 § · 27 refs · ~37 min read

Read →The Audit Field Was Always EmptyOn 2 August 2026 the EU AI Act stops being a paper and starts assigning penalties — to the same audit-trail gaps your engineering already has.

~1,650 words · ~9 min read→8 § · 21 refs · ~35 min read

Read →The Human Override That Isn'tA wave of state laws requires a human to decide alongside clinical AI — but not to record that the human did.

~1,500 words · ~8 min read→9 § · 27 refs · ~35 min read

Read →The Token You Didn't CountA model upgrade can keep the same rate card while delivering up to 35% more billing, with no pre-migration signal.

~1,500 words · ~8 min read→9 § · 26 refs · ~35 min read

Read →The Drift You Cannot See Until It Costs YouThe model version did not change. The behavior did.

~1,500 words · ~8 min read→9 § · 22 refs · ~35 min read

Read →The Protocol Stack Nobody AuditedFour agent protocols, four threat models, and the audit gap in the seams between them.

~1,500 words · ~8 min read→9 § · 24 refs · ~35 min read

Read →Capability Convergence and the Vendor Dependency TrapWhen quality stops separating vendors, switching cost moves to where you stopped looking.

~1,500 words · ~8 min read→9 § · 26 refs · ~35 min read

Read →What Changes When the Model ChangesA model upgrade is a controlled change, not a drop-in — and the vendor's benchmark is not your validation.

~1,500 words · ~8 min read→8 § · 24 refs · ~33 min read

Read →When a "Modest" Model Release Isn'tA vendor can call a release modest and be right about the model — and you can still be wrong to treat it as a drop-in.

~1,250 words · ~7 min read→8 § · 61 refs · ~20 min read

Read →The Bill Always Comes: Why "Enterprise-Grade" AI Code Often Isn'tWhen engineers apply standard software practices to AI problems without a firm grasp of both, the architectural mistakes pile up in the blind spot of their productivity.

~1,180 words · ~6 min read→8 § · 46 refs · ~40 min read

Read →Trust but Verify: The Aviation Standard for Engineering with AIAviation didn't get safer by trusting pilots more — it built verification structures around them. AI-assisted engineering needs the same discipline.

~1,170 words · ~6 min read→8 § · 46 refs · ~39 min read

Read →Citations or Guesses: The Five-Pass Rule and the Standard Behind ItOne charitable LLM read of a document is a stylistic match against a corpus that does not contain the system in front of it — and KellerAI rejects that posture at the boundary.

~1,180 words · ~6 min read→12 § · 114 refs · ~39 min read

Read →The Robustness Illusion"does not crash" is not "works correctly."

~990 words · ~5 min read→14 § · 18 refs · ~36 min read

Read →Observability TheaterEmpty telemetry fields train operators to trust signals that carry no information.

~960 words · ~5 min read→13 § · 20 refs · ~35 min read

Read →Detecting Over-Simplification in Prompts, Skills, Agents & PluginsWhen you compress an artifact, aggregate pass rates lie.

~1,000 words · ~5 min read→12 § · 17 refs · ~35 min read

Read →Fixpoints, Not FixesResource-lifecycle bugs don't have a fix — they have a fixpoint.

~980 words · ~5 min read→13 § · 22 refs · ~32 min read

Read →The Assumption You Didn't Write DownHow invisible SDK behavioral contracts become production incidents.

~1,000 words · ~5 min read→14 § · 23 refs · ~35 min read

Read →KOTH and Oracle: Selection and Improvement at ScaleWhen an ecosystem grows to hundreds of agents, picking the right one and making it better require infrastructure, not intuition.

~1,600 words · ~8 min read→13 § · 47 refs · ~38 min read

Read →The Thinking MoatWhy machine-enforced reasoning chains are a durable competitive advantage.

~950 words · ~5 min read→14 § · 25 refs · ~36 min read

Read →The Audit You Can AuditWhy most codebase health checks change nothing — and what it takes to trust one.

~2,100 words · ~11 min read→10 § · 32 refs · ~49 min read

Read →Why Self-Improving AI Needs a Trust DialAI systems are starting to improve themselves. Here is why that is dangerous, and what a fix looks like.

~960 words · ~5 min read→14 § · 21 refs · ~40 min read

Read →Why Your AI Skills Need EvidenceAn AI skill that demos well can still be quietly useless — because a stochastic model run once proves nothing.

~1,050 words · ~6 min read→11 § · 32 refs · ~38 min read

Read →

Access EngineeringForthcomingThe thesis that the highest-leverage AI work is upstream of the model.Coming soon

Super RAG v2.1ForthcomingRetrieval-augmented generation beyond the naive pipeline.Coming soon

Multi-Agent PatternsForthcomingCoordination, trust, and failure isolation in agent networks.Coming soon