Aviation stopped asking whether a twin-engine jet could cross an ocean and started asking how far it had earned the right to fly. AI agents need the same envelope.
Latest
Every KellerAI paper, newest first.
40 papers • 38 releases
The ETOPS rule is not "fly farther." It is "never fly past a reachable safe harbour" — and the same rule should govern every autonomous agent.
Why a wider autonomy budget is something you earn from failure-rate data — the ETOPS lesson for AI agents.
Assurance for an AI agent must attach to what it does, not to the model it runs.
Agent-to-agent capital allocation has arrived. The supervision layer it needs has not.
The Federal Reserve published the model-risk inventory schema. A bank can just use it.
The AI field has been asking the wrong question. Aviation and banking each solved the underlying engineering problem decades ago — under regulatory compulsion, at enormous cost, and with a precision the AI industry has yet to borrow.
Logging what your agentic system did is necessary. Wiring that log to a gate that can block a bad output is governance.
Token spend and usage limits are not two problems. They are one budget, governed by one rule, and a frontier-tier model makes every term in it bigger.
What we send is what every party evaluates: the same artifact for the human reviewer and the machine that reads it first.
Fable 5 can hand your request to Opus 4.8 — and tells you so. The question is what your team does with a model change that happens inside a single request.
Who gets the unrestricted model?
Anthropic attached a safety number to its biggest launch. You cannot compute it, check it, or appeal it — and your workload's number is the one that matters.
Anthropic priced the frontier at exactly double the model you already run. Your bill is now a function of which tier you bought, what your workload triggers, and how many turns your agents burn.
Connecting an agent to a public MCP server inherited a by-design execution surface the protocol's maintainer declined to patch.
Your eval is a measurement. Frontier models can recognize when they are being measured — and adjust their behavior accordingly.
On 2 August 2026 the EU AI Act stops being a paper and starts assigning penalties — to the same audit-trail gaps your engineering already has.
A wave of state laws requires a human to decide alongside clinical AI — but not to record that the human did.
A model upgrade can keep the same rate card while delivering up to 35% more billing, with no pre-migration signal.
The model version did not change. The behavior did.
Four agent protocols, four threat models, and the audit gap in the seams between them.
When quality stops separating vendors, switching cost moves to where you stopped looking.
A model upgrade is a controlled change, not a drop-in — and the vendor's benchmark is not your validation.
A vendor can call a release modest and be right about the model — and you can still be wrong to treat it as a drop-in.
When engineers apply standard software practices to AI problems without a firm grasp of both, the architectural mistakes pile up in the blind spot of their productivity.
Aviation didn't get safer by trusting pilots more — it built verification structures around them. AI-assisted engineering needs the same discipline.
One charitable LLM read of a document is a stylistic match against a corpus that does not contain the system in front of it — and KellerAI rejects that posture at the boundary.
"does not crash" is not "works correctly."
Empty telemetry fields train operators to trust signals that carry no information.
When you compress an artifact, aggregate pass rates lie.
Resource-lifecycle bugs don't have a fix — they have a fixpoint.
How invisible SDK behavioral contracts become production incidents.
The thesis that the highest-leverage AI work is upstream of the model.
Retrieval-augmented generation beyond the naive pipeline.
Coordination, trust, and failure isolation in agent networks.
When an ecosystem grows to hundreds of agents, picking the right one and making it better require infrastructure, not intuition.
Why machine-enforced reasoning chains are a durable competitive advantage.
Why most codebase health checks change nothing — and what it takes to trust one.
AI systems are starting to improve themselves. Here is why that is dangerous, and what a fix looks like.
An AI skill that demos well can still be quietly useless — because a stochastic model run once proves nothing.