Skip to main content
kellerai.blog

THE KELLERAI BLOG · ESTABLISHED 2026

Writing for engineers who ship,and still benchmark.

Six papers on the failure modes that matter most — observability, robustness, autonomy, and the assumptions teams never write down.

Subscribe →

Papers

The Audit Trail Nobody Acts OnLogging what your agentic system did is necessary. Wiring that log to a gate that can block a bad output is governance.
~1,500 words · ~8 min read10 § · 27 refs · ~30 min read
Read →
The Fable 5 Token EconomyToken spend and usage limits are not two problems. They are one budget, governed by one rule, and a frontier-tier model makes every term in it bigger.
~800 words · ~4 min read12 § · 30 refs · ~21 min read
Read →
The Application You Can AuditWhat we send is what every party evaluates: the same artifact for the human reviewer and the machine that reads it first.
~1,100 words · ~6 min read10 § · 21 refs · ~24 min read
Read →
When the Model Changes Mid-RequestFable 5 can hand your request to Opus 4.8 — and tells you so. The question is what your team does with a model change that happens inside a single request.
~1,350 words · ~7 min read8 § · 24 refs · ~20 min read
Read →
When Access Is the SafeguardWho gets the unrestricted model?
~1,400 words · ~7 min read8 § · 28 refs · ~20 min read
Read →
When the Vendor Grades Itself: The Safety Number You Cannot CheckAnthropic attached a safety number to its biggest launch. You cannot compute it, check it, or appeal it — and your workload's number is the one that matters.
~1,400 words · ~7 min read8 § · 29 refs · ~20 min read
Read →
When the Rate Card Has TiersAnthropic priced the frontier at exactly double the model you already run. Your bill is now a function of which tier you bought, what your workload triggers, and how many turns your agents burn.
~1,450 words · ~8 min read8 § · 28 refs · ~20 min read
Read →
The MCP Supply Chain You Forgot to GovernConnecting an agent to a public MCP server inherited a by-design execution surface the protocol's maintainer declined to patch.
~1,550 words · ~8 min read9 § · 26 refs · ~32 min read
Read →
The Eval That Doesn't Follow the Model to ProductionYour eval is a measurement. Frontier models can recognize when they are being measured — and adjust their behavior accordingly.
~1,650 words · ~9 min read8 § · 27 refs · ~37 min read
Read →
The Audit Field Was Always EmptyOn 2 August 2026 the EU AI Act stops being a paper and starts assigning penalties — to the same audit-trail gaps your engineering already has.
~1,650 words · ~9 min read8 § · 21 refs · ~35 min read
Read →
The Human Override That Isn'tA wave of state laws requires a human to decide alongside clinical AI — but not to record that the human did.
~1,500 words · ~8 min read9 § · 27 refs · ~35 min read
Read →
The Token You Didn't CountA model upgrade can keep the same rate card while delivering up to 35% more billing, with no pre-migration signal.
~1,500 words · ~8 min read9 § · 26 refs · ~35 min read
Read →
The Drift You Cannot See Until It Costs YouThe model version did not change. The behavior did.
~1,500 words · ~8 min read9 § · 22 refs · ~35 min read
Read →
The Protocol Stack Nobody AuditedFour agent protocols, four threat models, and the audit gap in the seams between them.
~1,500 words · ~8 min read9 § · 24 refs · ~35 min read
Read →
Capability Convergence and the Vendor Dependency TrapWhen quality stops separating vendors, switching cost moves to where you stopped looking.
~1,500 words · ~8 min read9 § · 26 refs · ~35 min read
Read →
What Changes When the Model ChangesA model upgrade is a controlled change, not a drop-in — and the vendor's benchmark is not your validation.
~1,500 words · ~8 min read8 § · 24 refs · ~33 min read
Read →
When a "Modest" Model Release Isn'tA vendor can call a release modest and be right about the model — and you can still be wrong to treat it as a drop-in.
~1,250 words · ~7 min read8 § · 61 refs · ~20 min read
Read →
The Bill Always Comes: Why "Enterprise-Grade" AI Code Often Isn'tWhen engineers apply standard software practices to AI problems without a firm grasp of both, the architectural mistakes pile up in the blind spot of their productivity.
~1,180 words · ~6 min read8 § · 46 refs · ~40 min read
Read →
Trust but Verify: The Aviation Standard for Engineering with AIAviation didn't get safer by trusting pilots more — it built verification structures around them. AI-assisted engineering needs the same discipline.
~1,170 words · ~6 min read8 § · 46 refs · ~39 min read
Read →
Citations or Guesses: The Five-Pass Rule and the Standard Behind ItOne charitable LLM read of a document is a stylistic match against a corpus that does not contain the system in front of it — and KellerAI rejects that posture at the boundary.
~1,180 words · ~6 min read12 § · 114 refs · ~39 min read
Read →
The Robustness Illusion"does not crash" is not "works correctly."
~990 words · ~5 min read14 § · 18 refs · ~36 min read
Read →
Observability TheaterEmpty telemetry fields train operators to trust signals that carry no information.
~960 words · ~5 min read13 § · 20 refs · ~35 min read
Read →
Detecting Over-Simplification in Prompts, Skills, Agents & PluginsWhen you compress an artifact, aggregate pass rates lie.
~1,000 words · ~5 min read12 § · 17 refs · ~35 min read
Read →
Fixpoints, Not FixesResource-lifecycle bugs don't have a fix — they have a fixpoint.
~980 words · ~5 min read13 § · 22 refs · ~32 min read
Read →
The Assumption You Didn't Write DownHow invisible SDK behavioral contracts become production incidents.
~1,000 words · ~5 min read14 § · 23 refs · ~35 min read
Read →
KOTH and Oracle: Selection and Improvement at ScaleWhen an ecosystem grows to hundreds of agents, picking the right one and making it better require infrastructure, not intuition.
~1,600 words · ~8 min read13 § · 47 refs · ~38 min read
Read →
The Thinking MoatWhy machine-enforced reasoning chains are a durable competitive advantage.
~950 words · ~5 min read14 § · 25 refs · ~36 min read
Read →
The Audit You Can AuditWhy most codebase health checks change nothing — and what it takes to trust one.
~2,100 words · ~11 min read10 § · 32 refs · ~49 min read
Read →
Why Self-Improving AI Needs a Trust DialAI systems are starting to improve themselves. Here is why that is dangerous, and what a fix looks like.
~960 words · ~5 min read14 § · 21 refs · ~40 min read
Read →
Why Your AI Skills Need EvidenceAn AI skill that demos well can still be quietly useless — because a stochastic model run once proves nothing.
~1,050 words · ~6 min read11 § · 32 refs · ~38 min read
Read →
Access EngineeringForthcomingThe thesis that the highest-leverage AI work is upstream of the model.Coming soon
Super RAG v2.1ForthcomingRetrieval-augmented generation beyond the naive pipeline.Coming soon
Multi-Agent PatternsForthcomingCoordination, trust, and failure isolation in agent networks.Coming soon