The Inversion: The Vendor's Eval Is Not Your Governance
There is a comfortable reading of the agent procurement decision, and it goes like this. A frontier lab spends nine figures training a model, runs an exhaustive battery of capability and safety evaluations, publishes a model card, submits to red-team exercises, and ships the result behind an enterprise contract with indemnities and a service-level agreement. The deploying institution, on this reading, inherits all that diligence the moment it signs. The hard part of governance has been done upstream by people with more compute and more expertise than any single deployer could muster, and what remains is plumbing. 1
That description is true in every particular, and it misses the point entirely. The vendor's eval is evidence about the vendor's artifact under the vendor's conditions, measured against the vendor's notion of what the model is for. It is not the deployer's governance, because governance is not a property of the model at all. It is a property of the institution that lets the model commit actions in its own name, against its own customers, with its own money and its own regulatory exposure. 1
Buying the model does not buy the accountability. When a deployed agent moves funds, sends correspondence, amends a record, files a regulatory return, or grants an entitlement, the consequence lands on the deployer — not on the laboratory that trained the weights. The deployer owns the residual risk of every model it runs, including the models it did not build and cannot inspect. This is not a moral claim. It is the settled posture of model-risk management, of safety engineering, and of the LLM-Agent Assurance Standard that synthesizes them, and it is the load-bearing rule of this paper. 23
Accountability is non-delegable. You cannot push it downward to a vendor any more than you can push it inward to the builder. The party that lets a model act owns what the model does.
The confusion this paper has to clear is a category error, not a factual one. The vendor's eval and the deployer's governance are both real, both valuable, and aimed at different objects. The eval is a measurement of the artifact: how the weights behave on a benchmark, under a threat model, against a specification the lab chose. Governance is a measurement of an institution's relationship to actions: which effects an agent is permitted to cause, under what verification, with what evidence, answerable to whom. A perfect eval tells you the model is good at what the lab tested; it tells you nothing about whether your institution can reconstruct, defend, and bound what the model does in your name. The two never substitute for each other, because no quantity of artifact-level assurance can constitute a deployment-level relationship the deployer alone is party to.
There is a further reason the substitution fails, and it is structural rather than rhetorical. The vendor optimizes against the vendor's objective and measures against the vendor's distribution; the deployer's task class, customer base, regulatory exposure, and consequence surface are not in that distribution and cannot be, because they are particular to the deployer. The model that scores well on the lab's coding benchmark has never been measured posting your payment instructions to your counterparties under your reconciliation rules. That measurement does not exist until the deployer makes it — which is the whole of the argument that follows.
The Banking Doctrine, by Name and Number
Model-risk management resolved this question long before agents existed, and the resolution is explicit in the supervisory text. SR 26-2 — the interagency US model-risk standard that in April 2026 superseded SR 11-7 after fifteen years as the canon — is built on effective challenge — credible, independent review with the authority to change the model — implemented through a validation lifecycle of conceptual soundness, ongoing monitoring, and outcomes analysis by backtesting, with risk-tiering by materiality and explicit vendor coverage that makes the deploying institution own the risk of models it buys; all five concepts are carried forward from SR 11-7 under its Develop / Validate / Govern pillars. 45 The last clause is the one this paper turns on. Vendor and third-party models are not outside the perimeter; they are inside it, and the institution that deploys them carries the model risk regardless of who built them. 4
The mechanism by which that coverage bites is worth tracing, because it is the same mechanism the agent case needs. SR 11-7's definition of a model is deliberately broad — any quantitative method that turns input data into estimates — and it covers vendor-supplied and third-party models without exception, which is precisely why it applies to deployed language models without amendment. 5 A bank that buys a model it cannot fully validate, explain, or audit does not thereby buy out of the obligation; it inherits a heavier one, because effective challenge now requires it to compensate for the inputs it cannot see. The supervisory text is blunt about the consequence: where a vendor will not disclose enough to permit independent validation, the institution must constrain the model's use, build compensating controls, or not deploy it. The duty does not soften because the supplier is large or the contract carries an indemnity. 5
Two structural commitments follow from that coverage, and both predate SR 26-2 — they are carried forward from SR 11-7 under its Develop / Validate / Govern pillars. 4 The first is the audit trail. A bank cannot defend a model decision by gesturing at a supplier's certificate; it must be able to reconstruct, from its own records, what the model did and why the institution permitted it. The standard requires documentation sufficient for an independent party — an examiner, an internal-audit function, a successor risk team — to retrace the validation and the decision without relying on the original modelers' memory. 5 Documentation is not a courtesy in this regime; it is a control. A decision that cannot be reconstructed is, for governance purposes, a decision that was never validated, however sound it may have been in the moment.
The second commitment is risk-tiering by materiality: the depth of validation, the frequency of review, and the seniority of sign-off all scale with how much the model's failure would cost. Capital is proportional to consequence. A model that prices a trillion-dollar book is not governed like one that ranks marketing leads, and the difference is not a matter of taste but of allocated scrutiny — more validation cycles, more frequent revalidation, a higher approval authority, and a tighter tolerance for outcomes drift. 5 SR 26-2 makes this explicitly risk-based: the intensity of oversight is a function of the model's materiality, not a flat standard applied uniformly. 4 The principle is the one that becomes OBL-TIER-001 below — scrutiny priced to blast radius — stated for static models a decade before anyone had to state it for actions.
It is worth naming the boundary the supervisor itself drew, because the boundary is where this paper begins rather than ends. SR 26-2 places generative and agentic AI expressly outside its formal model-risk scope as “novel and rapidly evolving,” deferring them to broader risk management while the agencies prepare further guidance — so the supervisor itself leaves the autonomous-agent-action gap open. 4 That exclusion is not a permission slip. The obligation to measure and govern an agentic system's behavior does not disappear because the supervisory letter declined to name it; it falls to the deploying institution to build the accounting the guidance left open. The doctrine that you own the vendor's model risk is fully in force; what is missing is the runtime mechanism that stops a wrong action at the moment a bought agent commits it — exactly the gap the LLM-Agent Assurance Standard (LAAS) was written to fill. 23
The Ratings That Were AAA Until They Weren't
The canonical case of outsourced model judgment that the deployer still owned is not an enforcement order. It is the 2007–08 reliance on external credit-rating-agency models. 6 In the run-up to the financial crisis, banks and investors treated AAA ratings from Moody's and Standard & Poor's on structured products — collateralized debt obligations and mortgage-backed securities — as a substitute for their own model-risk assessment. The rating was the eval. It came from a specialized third party with proprietary models, deep data, and a reputation built over a century, and it carried the highest grade the scale allowed. For many institutional buyers the rating was not merely persuasive; it was load-bearing in a way agent vendors' evals now threaten to become — it determined regulatory capital treatment, satisfied investment-mandate eligibility, and stood in, contractually and operationally, for the buyer's own analysis of the collateral. 6
It is worth tracing the sequence of what failed, because the structure recurs exactly in the agent case. First, the rating was an artifact-level assessment that the deployer treated as a deployment-level guarantee. The agency rated the instrument under the agency's assumptions about the housing market, the correlation between the underlying loans, and the behavior of the tranche structure — the vendor's notion of what the model was for, measured against the vendor's conditions. 6 Second, the deploying institution skipped its own validation on the strength of that artifact, because the rating carried the highest grade an external authority could assign and re-deriving it looked redundant. Third, the agency's model assumptions were wrong in a correlated way: it underweighted the probability that geographically dispersed mortgages would default together, so the diversification the tranche structure relied on evaporated precisely when it was needed. 67 Fourth — and this is the part that maps to a vendor-side model change — the agencies revised their methodologies and downgraded en masse, converting a paper safety into a realized loss the deployers had no independent reading to soften.
The ratings were catastrophically wrong, and the failure was not gradual. Tranches rated AAA — nominally as safe as sovereign debt — were downgraded by many notches, sometimes to junk, in a matter of months, and a large share of the highest-rated structured-finance issuance from the peak vintages ultimately impaired. 7 The official post-mortems were unsparing: the Financial Crisis Inquiry Commission concluded that the failures of the rating agencies were “essential cogs in the wheel of financial destruction,” and the US Senate Permanent Subcommittee on Investigations documented how the ratings were produced under conditions that made the highest grade routine rather than earned. 6 The losses did not land on the agencies. They landed on the deploying institutions that had treated the external assessment as a discharge of their own duty to understand the risk on their books. The investor who held the paper owned the loss; the rater who graded it did not. 7
The rating agencies forecast near-zero default on instruments that defaulted in droves. The deployers who substituted that forecast for their own judgment discovered that an external assessment never transferred the obligation — it only deferred the moment they learned they still owned it.
This is the forecast-versus-observed lesson in its purest form. A specialized vendor's prediction of safety, accepted in lieu of independent validation, is not governance — it is a bet that the vendor's model is right, placed by a party who will eat the loss if it is wrong. The deployers held no trace of their own that could have flagged the divergence early, no independent estimate of the residual to debit, and no measured experience base against which the agencies' numbers could have been challenged before the downgrade wave. They had outsourced not only the judgment but the instruments of governance that would have let them notice the judgment was failing. SR 11-7's explicit vendor-model clause was, in significant part, a supervisory response to exactly this failure: the rule that the deploying institution owns vendor model risk exists because a generation of institutions tried to outsource that risk and could not. 58 The reforms that followed reduced the regulatory hard-wiring of external ratings and pushed institutions back toward their own analysis — the codification, in law, of the lesson that a bought assessment is an input to governance, never a substitute for it. 8
Mapping to Agent Autonomy, in Concrete Units
The doctrine becomes operational only in units, and the LLM-Agent Assurance Standard supplies them. 3 The governance unit is the (agent, task-class) pair, and every action that pair attempts is assigned a Consequence Tier from CT0 to CT4 by an out-of-process gate that reads the action's observed effect surface — its actual reversibility, scope, and consequence — never the agent's self-report. 10 CT0 is read-only and merely traced. CT1 is a reversible internal write with a self-check. CT2 is a reversible or low-value external effect requiring an independent automated check or rehearsed rollback. CT3 is a hard-to-reverse or material external effect requiring independent, qualified, pre-commit verification with no self-grading. CT4 is irreversible or high-consequence, requiring all of CT3 plus human approval and an abstention default. 11
The escape rate is the controlled quantity: the rate at which a wrong result survives every applicable check and is acted upon, reported as integrity = 1 − escape_rate where higher-is-better reads better. 12 The per-tier tolerances are fixed in the bundle — 2% at CT2, 0.5% at CT3, and zero at CT4, where backtesting can never pass and a human gate plus abstention is mandatory. 11 A tolerance is only demonstrable if a backtest over enough trials can show it, so the standard enforces sample-size floors derived from the rule-of-three at 95% confidence with zero observed escapes: at least 149 cases at CT2 and 598 at CT3, below which the result is indeterminate, never a pass. 12
The independence of the check is measured, not asserted. The actor and its verifier must show an error correlation of ρ ≤ 0.2 — the phi coefficient on the joint-error table — with the upper bound of the 95% confidence interval at or below 0.2, computed on a frozen, versioned held-out adversarial eval set of at least 1,000 labeled hard cases. 13 This is the unit that defeats the most seductive shortcut available to a vendor-model deployer: pairing the bought model with a second instance of the same bought model as its own checker. Two siblings of one lineage share blind spots; their errors correlate; and a high-ρ verifier adds no independence no matter how the org chart is drawn. The deployer who wants a real CT3 control must source a verifier whose errors are demonstrably uncorrelated with the vendor's actor — a different lineage, a deterministic checker on the checkable subset, or human review — and prove the ρ bound on its own eval set. 13
Two further units bound the action in time and scale. The rollback horizon is the bounded time within which a committed action can still be undone; the blast radius is its consequence scale, aggregated cumulatively across actions and agents within a governance unit. 11 And authority is a revocable envelope — an earned range granted per tier on measured evidence that auto-contracts on a backtest breach, the same ratchet the ETOPS earned-range paper describes. 9 For a bought agent, every one of these units is the deployer's to measure, set, and enforce. The vendor supplies none of them.
A worked illustration makes the units concrete. Suppose a bank licenses a frontier model and wants its agent to post payment instructions to a counterparty — an external, hard-to-reverse, material effect. The gate reads that effect surface and derives CT3; the agent's own opinion that the task is “routine” is logged as a warning, not honored. 10 To grant the agent standing authority at CT3, the bank must demonstrate a backtested escape rate at or below 0.5% over at least 598 trials of the payment-posting task class, with zero observed escapes in that floor sample being the only way the rule-of-three interval clears the tolerance. 12 The verifier that pre-commits each instruction — say, a deterministic reconciliation check against the source-of-truth ledger, plus an independent-lineage model on the free-text fields — must show ρ ≤ 0.2 against the actor on a frozen 1,000-case adversarial set. 13 Every posting emits a hash-chained trace recording the tier derivation, the verifier selected, the trust status of the inputs, and the outcome. 15 The vendor's residual — the share of wrong postings the bought model produces that survive the check — is debited to the bank's CT3 escape budget, not the vendor's reputation. 14 If the cumulative blast radius over the rolling window pushes the unit past tolerance, the envelope auto-contracts and the agent drops to proposing-with-human-approval until the rate recovers. 9 None of these numbers exist in the vendor's model card. They are the bank's to produce, because the postings are the bank's to answer for.
The LAAS Obligations This Paper Owns
Three cross-cutting obligations from the standard carry the weight of this argument. They do not yet exist as stable identifiers in the supervisory corpus; the LAAS in-depth defines the obligation framework — the no-self-grading invariant, vendor attribution, untrusted-input tier-raising, lifecycle, and deterministic precedence — without enumerating them, so this paper introduces the IDs and binds them to that framework. 14
OBL-VEN-001 — Vendor Attribution
When an agent uses a vendor model or a third-party tool, the trace carries provenance and scope limits, the residual error is attributed to the deploying operator's escape-rate budget, and untrusted dependencies fail closed. The vendor's eval does not discharge the obligation. Tier floor: CT3. 14
This is the inversion made mechanical. A vendor model's mistake is not the vendor's reputational problem; it is a debit against the deployer's escape budget at the relevant tier, counted exactly as if the deployer's own model had erred. 14 The accounting symmetry is the whole point: there is no separate, more forgiving ledger for bought models, because the consequence of a wrong action is identical regardless of who trained the weights that produced it. The CT3 floor on this obligation is not arbitrary — it reflects that a vendor model crossing into a hard-to-reverse external effect is exactly the situation where an undebited residual does the most damage, and where the AAA-rating analogy bites hardest. Untrusted dependencies fail closed — when provenance or scope cannot be established, the action does not proceed on optimism. A tool whose output cannot be traced to a known source, or whose authority cannot be bounded, is treated as a tier-raising input rather than a trusted one, and the gate escalates accordingly. 14
OBL-TRC-001 — Reconstructable Trace
Every gated action emits an append-only, hash-chained trace sufficient to reconstruct the action, its tier derivation, the verifier selected, the inputs' trust status, and the outcome. Absent a reconstructable trace, the action is non-conformant. Tier floor: CT1. 14
The trace is the SR 26-2 audit trail rendered for runtime actions, and the standard makes its integrity categorical: it must be written to an append-only sink the actor cannot rewrite, with each actor appending to its own hash-chain periodically anchored into a shared Merkle root, so that a verdict from a gate the actor could have tampered with is no verdict at all. 15 A trace asserting an in-process or unsigned gate is non-conformant regardless of its verdict. 15
OBL-TIER-001 — Consequence-Scaled Rigor
Verification rigor scales with the consequence tier the gate derives from the action — the actor only proposes; it never sets its own tier or declares no verifier exists. Capital ∝ consequence: scrutiny is priced to blast radius. Tier floor: CT0. 14
This is materiality-tiering carried from SR 26-2 into the gate. 4 The derivation is ct = max(reversibility_rank, scope_rank, consequence_rank) for any action with an external effect, defaulting to CT4 whenever any axis is undetermined — an unknown reversibility treated as irreversible, an unknown scope as public, an unknown consequence as high. 10 A self-reported tier below the gate's is not a discount; it is a logged warning on a blocked action. 10 The agent proposes; the gate classifies.
The Economic Inversion: Discipline Is the Asset
The reflexive objection is that all of this is overhead — a compliance tax that slows the bought agent down and erodes the efficiency that justified buying it. It is the wrong axis. The discipline is not the tax; it is the asset that lets the deployer operate the agent at all, and operate it widely. 1
Consider the alternative posture, where a deployer accepts the vendor's eval as sufficient and runs the agent on trust. That deployer cannot extend the agent's authority safely, because it has no measured evidence of its own to extend it on; it cannot reconstruct an incident, because it kept no trace; and it cannot defend a single CT3 action to an examiner, because it outsourced the judgment that the examiner holds it accountable for. 5 Its agent is stuck at the narrow, ungoverned envelope — useful for low-consequence work and untrustable for anything that matters. 9
The disciplined deployer is in the opposite position. Because it attributes the vendor's residual to its own escape budget and measures it, it can prove the agent's reliability on a task class and earn a wider revocable envelope — longer autonomous runs, higher-consequence actions, fewer human doglegs — exactly the way an ETOPS operator earns range and exactly where the throughput lives. 9 The dogleg an under-trusted agent flies is the chain of human approvals inserted around every consequential step because no one can quantify how often the agent is silently wrong; each gate buys safety, but most of the time it is idle — the reviewer who rubber-stamps a thousand correct actions and stops reading carefully before the one that mattered. 16 The deployer who cannot measure its agent's escape rate has no principled way to remove any of those gates, so it keeps all of them, and the agent's nominal autonomy is fictional — every step still routes through a human.
Governance is what lets the gates be removed selectively and defensibly: keep the human gate where the measured tail risk warrants it, remove it where the demonstrated rate makes it dead weight. 9 The trace that satisfies the auditor is the same trace that lets the deployer extend authority with confidence — there is no separate “compliance artifact” and “operating artifact,” because the reconstructable record that defends a CT3 action to an examiner is the identical record that proves the rate is stable enough to widen the envelope. Governance is what converts a bought model from a liability you tolerate into a capability you can scale. The discipline earns the autonomy; it does not cap it.
The discipline is not the tax on a bought agent — it is the asset that lets you operate it widely. The trace that defends the action to the examiner is the same trace that earns the wider envelope.
Honest Limits
The posture this paper argues for bounds a specific quantity, and intellectual honesty requires naming what it does not bound. 2 Attributing the vendor's residual to your escape budget and backtesting it tells you the rate at which an undetected error survives your checks and is acted upon — relative to a measured experience base and a written specification. It does not tell you the rate of error relative to ground truth, because in the open-world class no exact oracle for ground truth exists; that is precisely why the standard separates the deterministically-checkable bucket, where the escape rate is driven toward zero by an exact verifier, from the open-world bucket, where it is bounded by backtest with a confidence interval. 12
Three boundary conditions are explicit, and each maps to a way the accounting can quietly fail. A backtest's binomial model assumes independent trials, so a correlated burst — a prompt-injection family or a vendor-side model change that fails many cases the same way — can make the interval optimistic until the next re-measurement catches it, which is why the standard re-measures on any model, prompt, tool, or policy change. 12 This boundary is the rating-agency failure stated in the agent's own units: a vendor-side model change that revises behavior across the whole task class is the methodology downgrade that turned AAA paper to junk overnight, and the only defense is the deployer's own continuous re-measurement, not the vendor's assurance that the change is benign. 7 The ground-truth labels must be produced independently of the verifier under test, or the backtest measures agreement rather than escape — the same effective-challenge independence that the companion paper on self-grading develops, applied to the labels rather than the checker. 1217
And the trace proves that the right checks ran, by the right party, with evidence — it is a standard of care, not a correctness guarantee, and conformance never asserts that no error can occur. 2 The escape rate it bounds is undetected failure relative to a measured experience base and a written specification, not relative to ground truth; in the open-world class no exact oracle for ground truth exists, which is exactly why the standard separates the deterministically-checkable bucket, where an exact verifier drives the escape rate toward zero, from the open-world bucket, where it is bounded by backtest with a confidence interval. 12 What the discipline buys is a bounded, attributed, reconstructable residual that is yours to own and defend — not the impossible promise that a bought model will never be wrong. That is the honest scope of the posture: it does not make a bought agent infallible; it makes the agent's reliability a measured, attributed, auditable quantity that can be earned, defended, and, when it drifts, taken back.
The Engineering Posture
This is the third paper of the KellerAI Banking Stack, and it closes a loop the other two open. “You cannot outsource governance” (this paper, The Obligation to Govern) and “you cannot self-grade” — the effective-challenge discipline of independent, low-correlation verification developed in Effective Challenge — are the same refusal stated in two directions: accountability is non-delegable downward to a vendor and inward to the builder alike. 1417 The traffic-light backtesting regime of Backtested, Not Demoed — authority priced in measured failure over enough trials to exclude luck, contracting automatically on a breach — is meaningless if the trace this paper requires cannot reconstruct which actions counted, and a vendor model's errors must enter that same escape budget through OBL-VEN-001 and OBL-AGG-001's cumulative aggregation. 1118 The three articles are one argument with three entry points: independence makes the backtest honest, the backtest prices the authority, and the trace plus vendor attribution make both reconstructable and complete. Together with the LAAS standard's CT lattice and the ETOPS-for-AI stack's earned, revocable envelope, they describe one regime under three vocabularies. 39
The operator's duty is therefore prior to action, not subsequent to incident. Before you let a bought model commit anything: own the trace; attribute the vendor to your own budget; scale scrutiny to blast radius.
Own the trace; attribute the vendor to your own budget; scale scrutiny to blast radius — before you let a bought model commit anything.
The brief companion to this paper — You Cannot Outsource the Obligation to Govern — introduces the core argument in a shorter form. Related papers: Effective Challenge develops the no-self-grading discipline that closes accountability inward; and Backtested, Not Demoed provides the traffic-light regime for pricing authority in measured failure.
End of paper
↑ Back to top