Effective Challenge: Independent Validation for Autonomous Agents

KellerAI

Section 01

A Second Model Grading the First Is Not Independence

Two reviewers who fail together are one reviewer. A verifier of the same lineage as the actor routinely shares its training data, its inductive biases, and its susceptibility to the same adversarial inputs and the same prompt injection; so it fails where the actor fails, and a check that is wrong in the same way as the thing it checks is not a check at all. 2 This is not a refinement of the popular practice. It is a refutation of it. The question is never "did a second model look at this?" It is "is the reviewer's residual error statistically independent of the actor's — and was the reviewer competent to grade this in the first place?" Everything in this paper follows from refusing to let "independent" mean merely "a different forward pass."

A second model grading the first is not independence. Independence is structural — separate lineage, separate context, a separate owning function, and demonstrably uncorrelated errors — never a second pass by a sibling of the thing under test.

The load-bearing inversion

Section 02

Effective Challenge, by Name and by Number

Banking did not arrive at this conclusion through theory. It arrived through losses, and it wrote the conclusion into supervisory law. The United States model-risk regime is built on a single demand: effective challenge. SR 26-2 — the interagency US model-risk standard that in April 2026 superseded SR 11-7 after fifteen years as the canon — is built on effective challenge — credible, independent review with the authority to change the model — implemented through a validation lifecycle of conceptual soundness, ongoing monitoring, and outcomes analysis by backtesting, with risk-tiering by materiality and explicit vendor coverage that makes the deploying institution own the risk of models it buys; all five concepts are carried forward from SR 11-7 under its Develop / Validate / Govern pillars. 3 4

Read the definition slowly, because every clause is load-bearing. The review must be credible — performed by someone competent to find the flaw, not a rubber stamp. It must be independent — performed by a validation function organizationally separate from the model's developers, with conflict-of-interest controls so that the reviewer's incentives do not align with the builder's. 3 And it must carry the authority to change the model — the standing to alter or reject it, not merely to file a comment. A challenger who reviews but cannot stop the action is not a control. A challenger who reports to the desk it polices is not independent. A challenger who shares the builder's blind spots is not credible on the cases that matter.

The structure SR 11-7 codified, and SR 26-2 carried forward, is severance. The party that validates a model's outputs must be structurally separate from the party that built it. 4 A model cannot mark its own homework, and neither can a reviewer who is, in every respect that matters, the same model wearing a different label. This is the banking sentence the AI field has not yet internalized: independence is not a property of the review event. It is a property of the relationship between the reviewer and the thing reviewed — and that relationship can be measured.

There is one more clause worth surfacing, because it defines the gap this stack exists to close. SR 26-2 places generative and agentic AI expressly outside its formal model-risk scope as "novel and rapidly evolving," deferring them to broader risk management — so the supervisor itself leaves the autonomous-agent-action gap open. 5 The doctrine of effective challenge is fifteen years mature and battle-tested. The supervisor declined to extend it to agents. The deployer inherits the obligation anyway.

Section 03

The London Whale: When Independence Is Nominal

The cautionary case for effective challenge is one of the largest trading losses in banking history, and it is, at bottom, a story about a model reviewed without a real challenger.

In 2012, JPMorgan Chase's Chief Investment Office held a large, complex position in credit derivatives — the Synthetic Credit Portfolio. As that portfolio's risk profile grew uncomfortable against its Value-at-Risk limits, the CIO changed the VaR model that governed it. The new model produced a substantially lower risk figure for the same positions — it understated the risk — and on that flattering number the desk continued to build. 6 The model change was reviewed. That is the trap. It was not reviewed by a genuinely independent, empowered, credible challenger; the review was nominal — conducted close to the desk that benefited from the answer, without the structural separation and conflict-of-interest controls that effective challenge demands. 7 The position deteriorated, the losses mounted as the market moved against a now-visible whale, and the trading loss reached roughly US$6.2 billion. 6

The US Senate Permanent Subcommittee on Investigations, in its 2013 report, faulted the model-risk governance directly: a VaR model change that understated risk, approved through a process that lacked genuine independent validation, on a portfolio whose size made the consequences of a wrong number enormous. 7

The new model was not defeated by bad math. It was defeated by a review that was independent on the org chart and captured in the errors — a challenger too close to the desk to challenge it.

The failure was structural, not arithmetic

The lesson is exact, and it is the lesson this entire paper turns on. The model's own judgment of its risk — or a captured, sibling reviewer's, which is functionally the same judgment in a different chair — was the failure. Independence was nominal: it existed on paper, in the reporting lines and the review steps. It did not exist where it had to, which is in the errors. A reviewer whose mistakes correlate with the builder's, or whose incentives align with the builder's, will pass exactly the model the builder most wants passed. That is the failure mode an AI ensemble reproduces precisely when a critic model of the actor's own lineage signs off on the actor's own output: a review that looks independent and is not, on the one case where it had to be.

Section 04

The Shared Move: Severance Plus Qualification

Banking is not alone in reaching this structure, and the convergence is the evidence that the structure is forced by the problem rather than chosen by a community's conventions. Aviation software certification reached the same place under DO-178C, which requires certain verification objectives to be performed with independence — by someone other than the developer of the artifact under verification — with the rigor of that independence scaling to the consequence of failure. 8 The existing ETOPS-for-AI stack already develops the aviation analogue in full — the diversion-airport doctrine of Always a Runway and the reliability accounting of Reliability You Can Bank — and rather than re-derive it, this paper points to it: two regulated safety regimes, banking and aviation, independently rediscovered that the checker must be severed from the builder. 1

But severance alone is not the whole move, and the second half is the one AI practice most often omits. DO-178C's companion standard, DO-330, requires that any tool whose output you rely upon be itself qualified — its competence demonstrated, documented, and change-controlled — before its verdict counts as assurance. 9 An unqualified checker is not a control even if it is perfectly independent. Banking encodes the same requirement inside "credible": the validation function must be competent to find the flaw, not merely separate from the people who could not. So the shared move has two parts, not one. Severance: the verifier is structurally separate from the actor — separate lineage, separate context, separate owning function. Qualification: the verifier is itself proven competent on the task class it gates, before it is allowed to gate. A second model that is both independent of and unqualified for the task is theater; a model that is qualified but correlated with the actor is theater; only the conjunction is a control.

Section 05

Mapping to Agent Autonomy: Independence in Concrete Units

Translate this into the units the LAAS corpus uses to govern agent actions, and the abstraction becomes measurable. The governance unit is the (agent, task-class) pair, and each is assigned a consequence tier from CT0 to CT4 by an out-of-process gate that reads the action's observed effect surface — its actual reversibility, scope, and consequence — never the agent's self-report. 10 11

Tier	Action class	Verifier requirement
CT0	Read-only	Trace only
CT1	Reversible internal write	Self-check against an exact verifier where one exists
CT2	Reversible or low-value external effect	Independent automated check or rehearsed rollback; bounded residual; trace
CT3	Hard-to-reverse or material external effect	Mandatory independent, qualified, pre-commit verification; no self-grading; backtested escape rate ≤ tolerance; rollback plan; trace
CT4	Irreversible or high-consequence	Everything CT3 requires plus human approval, abstention default, and full evidence

This is where effective challenge becomes operational rather than aspirational. At CT3 and above, "independent" is not a posture; it has three measurable conditions, each of which maps to one obligation defined below.

Separate lineage. A model verifier is independent of the actor only if it is of a distinct model lineage — a different kind of mechanism altogether, such as a deterministic checker, counts at any tier for the deterministically checkable class. 12 A verifier sharing the actor's lineage is presumed non-independent, full stop, because the two will share training data, biases, and prompt-injection susceptibility and will therefore fail in common. 2

Verifier qualification. The challenger must be qualified for the tier it gates — its competence on the task class measured and documented before it may grade, with a negative-test suite and a change-controlled, trace-recorded verifier version — the direct DO-330 tool-qualification analogue. 9 11 An unqualified challenger is not a control regardless of its independence.

Measured low error correlation. The actor and the verifier must fail independently on the task class, and this is the condition almost no AI system measures. The ceiling is ρ ≤ 0.2 — the phi coefficient on the joint-error indicator — and it is measured on a frozen, versioned held-out adversarial eval set of at least 1,000 labeled hard cases, by running the actor and the candidate verifier independently over the set and computing the phi coefficient on the two-by-two contingency of actor-wrong against verifier-wrong. 13 11 Phi is chosen over chance-corrected agreement statistics deliberately, because the quantity of interest is correlated failure, not agreement, and phi on the joint-error table is exactly the linear correlation of the two error indicators. 13 Admission requires the upper bound of the 95% confidence interval to be at or below 0.2 — not the point estimate, because validating the point estimate would ignore sampling error. A verifier whose interval crosses 0.2 is not admitted, and where correlation cannot be bounded, any CT3-or-above action falls back to a deterministic or human check. 13

These conditions do not stand alone; they compose with the rest of the envelope. The verifier sits pre-commit — before the action crosses its rollback horizon, the bounded time within which a committed action can still be undone — because a challenger that fires after the irreversible step is a comment on a fait accompli. 11 Authority granted at a tier lives inside a revocable envelope that auto-contracts on a backtest breach, the same earned-and-revocable range the ETOPS stack develops in Earned Range. 1 And the escape rate the verifier is held to is the backtested rate at which a wrong result survives every applicable check and is acted upon, controlled to tier tolerance — which is exactly the subject of the second Banking Stack article. 1 14

Section 06

The LAAS Obligations This Article Defines

The LAAS in-depth, Section 11, defines the obligation framework — the no-self-grading invariant, vendor attribution, lifecycle, deterministic precedence — but does not enumerate stable identifiers for the independence controls. 15 This article introduces three, defines them inline, and maps them by ID. Each cites the LAAS in-depth as the framework source.

Obligation ID	Definition	Tier floor
OBL-IND-001	A CT3+ action is verified by a challenger structurally separate from the actor — separate model lineage, separate prompt and context, separate owning function — that did not build the thing it grades. A second pass by a sibling of the actor does not satisfy this obligation.	CT3
OBL-VQ-001	The verifier is itself qualified for the tier it gates: its competence on the task class is measured and documented before it may grade — a DO-330 tool-qualification analogue, with a negative-test suite and a change-controlled verifier version recorded in the trace. An unqualified challenger is not a control.	CT3
OBL-IRR-001	Actor and verifier errors must be statistically independent on the task class: ρ ≤ 0.2 (phi, 95%-CI upper bound) on a frozen ≥ 1,000-case adversarial set. A verifier whose errors correlate with the actor's adds no independence regardless of org chart.	CT3

OBL-IND-001 is the direct translation of effective challenge: the party that validates is severed from the party that built. 3 It is the refusal to self-grade, stated as a hard precondition on hard-to-reverse action.

OBL-VQ-001 closes the gap a naive reading of OBL-IND-001 would leave open: a separate-but-incompetent reviewer. Severance without qualification is independence theater. 9

OBL-IRR-001 is the obligation that makes "demonstrably uncorrelated errors" a number rather than a hope, and it is the one that the London Whale review failed in spirit and that an AI critic of the actor's own lineage fails in fact. 7 2

The three obligations interlock. OBL-IND-001 demands structural severance; OBL-VQ-001 demands the severed reviewer be competent; OBL-IRR-001 demands that competence and severance actually produce independent failures, measured. Strip any one and the other two collapse into the comfortable story this paper opened by refusing.

Section 07

The Economic Inversion: Independence Is the Asset

The instinct is to read all of this as overhead — a tax on agent throughput, a set of gates that slow the system down. The instinct is backwards, and the banking and aviation records prove it backwards.

Effective challenge is not what banks pay to be allowed to use models. It is what lets banks use models at scale at all. A validation function that can credibly, independently certify a model's risk is the precondition for putting that model on a large book — because the institution, the board, and the supervisor can trust the number. Without effective challenge, every consequential model is a liability nobody can size; with it, the same model becomes an asset the institution can lean on. The discipline does not cap the model's use. It is what makes wide use defensible.

The same inversion governs the existing ETOPS stack, which this paper deliberately reuses rather than re-derives: rigorous reliability accounting is what unlocks the wider operating envelope — the twin flies the direct over-water route because its reliability is provable, not despite the discipline. 1

For agents the corollary is direct and is the load-bearing economic argument. A proven-independent challenger — separate lineage, qualified, ρ ≤ 0.2 — is what lets you grant an agent a wider autonomy budget without a human in the loop on every action. The organization that cannot prove its verifier is independent has no principled way to remove the human approval gate, so it keeps the gate on every CT3 action, and the agent's nominal autonomy is fictional — every consequential step still routes through a person. 11 The organization that can prove independence can remove the gate exactly where the data says the challenger is doing the job a human would have done, and keep it only where the tail risk warrants it. The dividend is not the absence of checking. It is supervision allocated to where the evidence says it is needed — and proven independence is what reallocates it. 1 Independence is not the tax on autonomy. It is the asset that earns it.

Section 08

Honest Limits: What a Verified Challenger Does and Does Not Buy

This discipline must be stated with its limits, or it becomes the very over-claim it is built to prevent.

An independent, qualified, low-correlation challenger does not make an agent correct. It bounds the rate of undetected failure relative to a measured experience base and a specification — not relative to ground truth. 16 The ρ ≤ 0.2 guarantee is a guarantee about a measured eval set: a frozen, versioned, adversarially stressed set of at least 1,000 hard cases. 13 If the deployment distribution drifts away from that set, the measured correlation can decay, and the only sound response is to treat drift as a monitored, re-measured risk — re-running the correlation protocol on any model, prompt, tool, or policy change — rather than asserting the bound still holds. 11 Backtesting, here as everywhere in this stack, bounds undetected failure against a base and a spec. It does not certify the spec, and it cannot see the failure mode the adversarial set never sampled.

There is a sharper limit at the top of the tier ladder. At CT4 the escape-rate tolerance is zero, and a binomial upper bound over any finite sample is strictly greater than zero — so no amount of backtesting can ever pass a CT4 action on measured residual alone. 14 That is not a flaw in the accounting; it is the accounting telling the truth. CT4 is exactly the tier that requires a human gate and an abstention default, because the irreversible high-consequence action is the one where measured independence is necessary but not sufficient. 11

And the correlation measure itself has a floor of honesty. A verifier of genuinely distinct lineage can still share a systematic blind spot with the actor that the adversarial set failed to probe — common-mode failure is bounded by the set, not abolished by it. 2 The discipline reduces the probability that the actor and verifier are wrong together; it does not reduce it to zero, and claiming otherwise would reproduce the London Whale's error of mistaking a flattering number for a safe one. 7

The honest scope statement is this. A proven-independent challenger turns "we hope the reviewer caught it" into "we measured how often the reviewer fails when the actor fails, and it is below 0.2 on a frozen adversarial set, and we re-measure on every change." 13 That is a smaller claim than infallibility and a far larger one than a second forward pass. It is the difference between a control and a comment.

Section 09

The Posture: Name Your Challenger Before the Agent Commits

The deployable discipline collapses into one demand made before an agent is allowed to commit a hard-to-reverse action, not after it has.

Name your challenger. Not "a reviewer model" in the abstract — the specific verifier, by version, that will gate this (agent, task-class) pair at CT3 and above. Then prove its independence in the three measurable dimensions, and record the proof in the trace. Prove separate lineage — that the challenger is not a sibling of the actor, sharing its blind spots (OBL-IND-001). 15 Prove it is qualified — competent on the task class, measured and documented before it may grade, a DO-330 analogue (OBL-VQ-001). 9 Prove ρ ≤ 0.2 — actor and verifier errors uncorrelated to the bounded ceiling, phi on a frozen ≥ 1,000-case adversarial set, 95%-CI upper bound, re-measured on every change (OBL-IRR-001). 13

This is the same refusal the third Banking Stack article makes in the other direction. You cannot self-grade (this article) and you cannot outsource the obligation to govern (article three) are one principle: accountability is non-delegable — inward to the builder or downward to a vendor. 1 And it is what makes the second article's backtest trustworthy: a verifier that is not independent cannot honestly count its own exceptions, so the escape-rate measurement of Backtested, Not Demoed presupposes the independence this article requires. 1

The London Whale had a reviewer. JPMorgan had a process. Both were independent on the org chart and captured in the errors, and US$6.2 billion was the price of mistaking the first for the second. 6 7 The AI ensemble that lets a critic model of the actor's own lineage sign off on the actor's own output is making the same mistake, at machine speed, on every action.

Name your challenger and prove its independence — separate lineage, qualified, ρ ≤ 0.2 — before the agent commits.

The brief companion to this paper — Effective Challenge — introduces the core argument in a shorter form. Related papers: Backtested, Not Demoed prices authority in measured failure data; and You Cannot Outsource the Obligation to Govern holds the deployer accountable for every model it runs.

End of paper

↑ Back to top

The Second Model Is Not a Check

Context

The Finding