THE KELLERAI BLOG · ESTABLISHED 2026
Writing for engineers who ship,and still benchmark.
Six papers on the failure modes that matter most — observability, robustness, autonomy, and the assumptions teams never write down.
Subscribe →Papers
Logging what your agentic system did is necessary. Wiring that log to a gate that can block a bad output is governance.
~1,500 words · ~8 min read→10 § · 27 refs · ~30 min read
Read →Token spend and usage limits are not two problems. They are one budget, governed by one rule, and a frontier-tier model makes every term in it bigger.~800 words · ~4 min read→12 § · 30 refs · ~21 min read
Read →What we send is what every party evaluates: the same artifact for the human reviewer and the machine that reads it first.~1,100 words · ~6 min read→10 § · 21 refs · ~24 min read
Read →Fable 5 can hand your request to Opus 4.8 — and tells you so. The question is what your team does with a model change that happens inside a single request.~1,350 words · ~7 min read→8 § · 24 refs · ~20 min read
Read →Who gets the unrestricted model?~1,400 words · ~7 min read→8 § · 28 refs · ~20 min read
Read →Anthropic attached a safety number to its biggest launch. You cannot compute it, check it, or appeal it — and your workload's number is the one that matters.~1,400 words · ~7 min read→8 § · 29 refs · ~20 min read
Read →Anthropic priced the frontier at exactly double the model you already run. Your bill is now a function of which tier you bought, what your workload triggers, and how many turns your agents burn.~1,450 words · ~8 min read→8 § · 28 refs · ~20 min read
Read →Connecting an agent to a public MCP server inherited a by-design execution surface the protocol's maintainer declined to patch.~1,550 words · ~8 min read→9 § · 26 refs · ~32 min read
Read →Your eval is a measurement. Frontier models can recognize when they are being measured — and adjust their behavior accordingly.~1,650 words · ~9 min read→8 § · 27 refs · ~37 min read
Read →On 2 August 2026 the EU AI Act stops being a paper and starts assigning penalties — to the same audit-trail gaps your engineering already has.~1,650 words · ~9 min read→8 § · 21 refs · ~35 min read
Read →A wave of state laws requires a human to decide alongside clinical AI — but not to record that the human did.~1,500 words · ~8 min read→9 § · 27 refs · ~35 min read
Read →A model upgrade can keep the same rate card while delivering up to 35% more billing, with no pre-migration signal.~1,500 words · ~8 min read→9 § · 26 refs · ~35 min read
Read →The model version did not change. The behavior did.~1,500 words · ~8 min read→9 § · 22 refs · ~35 min read
Read →Four agent protocols, four threat models, and the audit gap in the seams between them.~1,500 words · ~8 min read→9 § · 24 refs · ~35 min read
Read →When quality stops separating vendors, switching cost moves to where you stopped looking.~1,500 words · ~8 min read→9 § · 26 refs · ~35 min read
Read →A model upgrade is a controlled change, not a drop-in — and the vendor's benchmark is not your validation.~1,500 words · ~8 min read→8 § · 24 refs · ~33 min read
Read →A vendor can call a release modest and be right about the model — and you can still be wrong to treat it as a drop-in.~1,250 words · ~7 min read→8 § · 61 refs · ~20 min read
Read →When engineers apply standard software practices to AI problems without a firm grasp of both, the architectural mistakes pile up in the blind spot of their productivity.~1,180 words · ~6 min read→8 § · 46 refs · ~40 min read
Read →Aviation didn't get safer by trusting pilots more — it built verification structures around them. AI-assisted engineering needs the same discipline.~1,170 words · ~6 min read→8 § · 46 refs · ~39 min read
Read →One charitable LLM read of a document is a stylistic match against a corpus that does not contain the system in front of it — and KellerAI rejects that posture at the boundary.~1,180 words · ~6 min read→12 § · 114 refs · ~39 min read
Read →"does not crash" is not "works correctly."~990 words · ~5 min read→14 § · 18 refs · ~36 min read
Read →Empty telemetry fields train operators to trust signals that carry no information.~960 words · ~5 min read→13 § · 20 refs · ~35 min read
Read →When you compress an artifact, aggregate pass rates lie.~1,000 words · ~5 min read→12 § · 17 refs · ~35 min read
Read →Resource-lifecycle bugs don't have a fix — they have a fixpoint.~980 words · ~5 min read→13 § · 22 refs · ~32 min read
Read →How invisible SDK behavioral contracts become production incidents.~1,000 words · ~5 min read→14 § · 23 refs · ~35 min read
Read →When an ecosystem grows to hundreds of agents, picking the right one and making it better require infrastructure, not intuition.~1,600 words · ~8 min read→13 § · 47 refs · ~38 min read
Read →Why machine-enforced reasoning chains are a durable competitive advantage.~950 words · ~5 min read→14 § · 25 refs · ~36 min read
Read →Why most codebase health checks change nothing — and what it takes to trust one.~2,100 words · ~11 min read→10 § · 32 refs · ~49 min read
Read →AI systems are starting to improve themselves. Here is why that is dangerous, and what a fix looks like.~960 words · ~5 min read→14 § · 21 refs · ~40 min read
Read →An AI skill that demos well can still be quietly useless — because a stochastic model run once proves nothing.~1,050 words · ~6 min read→11 § · 32 refs · ~38 min read
Read →Access EngineeringForthcomingThe thesis that the highest-leverage AI work is upstream of the model.Coming soon
Super RAG v2.1ForthcomingRetrieval-augmented generation beyond the naive pipeline.Coming soon
Multi-Agent PatternsForthcomingCoordination, trust, and failure isolation in agent networks.Coming soon