Skip to main content
kellerai.blog

A Model Cannot Classify the Risk of Its Own Actions

The thesis that the highest-leverage AI work is upstream of the model.

KellerAI White Paper · Earned Autonomy & Agents · May 2026

Context

When an agent gets access to an external action — a payment, a deployment, a deletion — the model cannot reliably self-classify the blast radius of what it is about to do. An out-of-process gate must. Permission boundaries, workspace scoping, capability tiers, and action traces compose into a working access-control regime for production agents — and the patterns are already live in deployed systems, not theory.

The Finding

Access engineering is a load-bearing safety primitive upstream of model capability: three independent enforcement layers (identity, permission, resource), a conjunctive permission lattice, and an audit trail turn silent scope expansion into a detectable event. The honest frame: today's deployments still lack escape-rate measurement and backtested verifiers.

Tags:
AuthorizationAgent GovernanceAccess ControlWorkspace ScopingAudit & Trace
Paper Details
CategoryEarned Autonomy & Agents
AudienceArchitects, security engineers, and platform teams deploying multi-agent systems or planning agent IAM at scale.
MethodAnalytical · evidence-based
LengthTBD · NaN min
Sections0
DateMay 2026
AuthorsKellerAI
Read the full paper
Section 01

The agent cannot gate itself.

Every production agent operates inside an authorization gap. The model knows what it was asked to do. It does not know what it is permitted to do — not with the rigor a financial or infrastructure control requires. Prompt instructions are not access policies. A system prompt that says "only process refunds under $500" is a suggestion the model can misread, misapply, or be manipulated into ignoring.

The gap matters most at the edges: high-value transactions, irreversible operations, cross-tenant actions. Precisely the moments where an authorization error is most costly are the moments when a reasoning model under an adversarial or ambiguous input is least reliable as its own enforcer.

Prompt instructions are not access policies. A system prompt is a suggestion the model can misread, misapply, or be manipulated into ignoring.

Access engineering

This is not an indictment of the model. It is an architectural observation. The model is the wrong layer for authorization. The right layer is an out-of-process gate that the model cannot override — one that evaluates identity, permission scope, and resource context before any external action is executed.

Section 02

Three enforcement layers, composed.

A working access-control regime for agents does not look like a single permission flag. It looks like three enforcement layers that must all pass before an action reaches the external world.

Identity. Every agent call carries a verifiable identity — not a session token the model generated, but a cryptographically bound credential issued at provisioning time. The identity determines which capability tier the agent operates in and which workspaces it can touch.

Permission. Capabilities are not granted individually. They compose through a conjunctive lattice: an agent that holds payments:write and accounts:read can initiate a refund, but only if the target account is in its assigned workspace. No single permission grants cross-workspace reach. The conjunction is enforced by the gate, not asserted by the model.

Resource. Even a permitted action on a permitted resource is gated by context — transaction ceiling, rate limit, reversibility class. A DELETE on a soft-deletable record clears. The same call on a hard-delete path escalates to a human confirmation queue before execution.

Every action that clears all three layers is written to an immutable trace. Not a log the agent writes — a trace the gate writes, independently, before returning the response. The trace is the audit artifact SR 26-2 governance requires.

Section 03

Silent scope expansion is the quiet failure mode.

The failure mode that access engineering is specifically designed to prevent is not dramatic. It does not announce itself. It looks like this: a stronger model replaces a weaker one, inheriting the same capability tier. The new model is better at reasoning, better at tool use, better at recovering from ambiguous instructions. It begins reaching further — not because its permissions changed, but because it is more capable of using the permissions it already has.

Scope expansion through model upgrade is not a security breach. No access control was violated. Every action the new model took was permitted. The problem is that the risk profile of the tier shifted without any deliberate human decision to expand it. The access regime was designed for a less capable actor and was never recalibrated.

This is the gap that today's deployments have not closed: there is no mechanism to measure the effective escape rate of a capability tier across model versions, and no backtest harness to simulate what a new model would have done with the same permissions against historical inputs. The in-depth companion documents both where these controls are implemented and where they are not yet.

Section 04

The in-depth: what the architecture actually looks like.

The patterns described here are not theoretical. They are drawn from a deployed system operating across payment, deployment, and data-deletion surfaces. The in-depth companion works through the full architecture: the 16-permission lattice, workspace isolation mechanics, action-trace schema, capability tier assignment, and the honest accounting of what is and is not yet instrumented.

It also covers the regulatory frame — where SR 26-2 model-risk governance intersects with agentic authorization, what a model-risk examiner would ask to see, and which of those artifacts the current architecture produces versus which remain gaps. If your organization is evaluating agent deployment in a supervised financial or healthcare context, the in-depth is the operative document.

Read the in-depth companion →

End of paper↑ Back to top