Governed Like Aviation, Audited Like Banking: The Integrity Discipline AI Is Rediscovering

KellerAI

Section 01

The Reframe: Hallucination Is the Absence of a Certificate

The standard framing of hallucination is generative: a language model produces a token sequence that, taken as a factual assertion, is false, yet the model delivers it without caveat and with apparent confidence. Research has documented the phenomenon extensively, and recent theoretical work has established structural lower bounds on how frequently it must occur. Kalai, Nachum, Vempala, and Zhang (2025) prove that generation is at least as hard as validity classification: a model that never produced false outputs would, by that very capability, also solve the problem of detecting false inputs, which is provably at least as hard as generation itself. 1 Banerjee, Agarwal, and Singla (2024) establish a complementary undecidability result: for sufficiently rich output classes, no finite process can determine whether a given output is true, so some hallucination rate is structurally irreducible at the single-pass level. 2

These are genuine, important results. They are also widely misread. Neither result says that a system cannot be built to guarantee the truthfulness of its emitted assertions. Both results are scoped to a single generation step. The user does not experience a generation step; the user experiences a delivered output. A system that governs what it chooses to deliver—that abstains when it cannot certify, and emits only what it can stand behind—is not refuted by single-pass inevitability theorems. The theorems locate the impossibility correctly: inside a generation function. They say nothing about a system that wraps generation in a certification layer and withholds the output when the certificate is absent.

The reframe, then, is this. Drop the question “can a model stop hallucinating?” and ask instead: “what can a system guarantee about the claims it chooses to emit?” The answer to that question is not bounded by the generation-step theorems. It is bounded by what can be verified at delivery time, and by how aggressively the system is willing to abstain when verification fails. Certification of an emitted claim is an architectural property, not a model property. The system that builds it does not need to solve generation-time truthfulness—it needs only to detect when generation-time truthfulness cannot be confirmed, which is the strictly easier problem.

Hallucination is not a model pathology to be tolerated. It is the absence of a certificate—an engineering gap, not a cognitive one. The unanswerable question is whether a model can be made never to hallucinate. The answerable question is what a system can guarantee about what it emits.

Core reframe

This reframe has a precedent in both industries we will examine. Aviation does not ask whether a navigation system can be made never to produce a wrong position estimate. It asks what integrity guarantee—what bound on the rate of wrong estimates that escape detection—the system can provide. Banking does not ask whether a risk model can be made never to produce a wrong loss estimate. It asks whether the model’s stated confidence interval is validated, measured, and accountable to a supervisor. In both cases, the practical achievement was not the elimination of error; it was the elimination of undetected error, at a tolerance scaled to the consequence of a miss. That distinction—between error and undetected error—is the load-bearing concept this paper develops.

Section 02

Two Buckets: What You Can Check (D) and What You Must Bound (P)

The system’s output decomposes into atomic claims. Some of those claims belong to what we will call the deterministic bucket—call it D. A claim is in D if an almost-surely-correct, inference-time-computable indicator for its truth exists: a verifier, an oracle. Schema validity is in D: a JSON object either conforms to a schema or it does not, and a validator decides this exactly. Arithmetic is in D. Citation resolution is in D—a cited URL either resolves to the stated content or it does not. Policy conformance is in D—a proposed action either satisfies a formally stated policy or it violates it. The common characteristic of D-claims is that no probabilistic machinery is needed to assess them; the truth-value can be computed without relying on the model’s own estimation.

The complement of D is P—the probabilistic bucket. A claim is in P when no inference-time oracle exists. Open-world factual claims fall here: whether a particular event occurred, whether a particular person holds a particular view, whether a prediction will prove accurate. The model’s training distribution is the only evidence available at delivery time, and the training distribution cannot certify individual facts the way a schema validator can certify schema conformance. The statistical distance between the evidence distributions under “claim true” and “claim false” may be nonzero, but it is not one; so a monitor working from that evidence will have a bounded-but-nonzero miss rate by information-theoretic necessity.

On D, the correct strategy is elimination. Build a gate that is sound—it never emits a D-claim unless the verifier confirms it—and default-deny—it abstains whenever bucket membership is itself uncertain. Under those conditions, the false-assertion rate on D is zero. This is not an approximation; it follows from the definitions. If only the verifier’s verdict is emitted, and the verifier is correct, then D-claims are always true when emitted. The generation step is rendered irrelevant on D: the model’s own propensity to err does not enter the guarantee, because the model’s assertion is never trusted. This is correctness relative to specification—Hoare logic applied to token streams.

The reattribution clause that follows is load-bearing and deserves explicit statement. If a sound, complete, and default-deny gate emits a D-claim that turns out to be wrong in the real world, the wrongness is necessarily a defect in the specification of the verifier, not a generative error. The model ran correctly on D; the oracle was incorrect or incomplete. This relocates the error to where it has always belonged—the requirements layer—where it is auditable, named, and fixable, rather than diffused into the opaque interior of a generation process. The error has not disappeared; it has been made accountable.

On P, elimination is impossible. Selective prediction is the correct frame. 3 4 The gate chooses, for each P-claim, whether to assert it or to abstain. The objective is to minimize the rate of false assertions among the claims that are asserted—the selective risk—subject to a constraint that the gate asserts often enough to be useful—the coverage constraint. The tolerance on selective risk is set by consequence: how bad is a false assertion of this type? A claim about a patient’s medication dose demands a tighter tolerance than a claim about a historical date in a consumer application. The coverage constraint is set by utility: how often must the system answer rather than abstain? The two parameters are in tension, and the appropriate operating point is a governance decision, not a technical one.

Section 03

Aviation Solved the Systematic/Random Split

Civil aviation’s safety record—catastrophic failure rates measured in the range of one per billion flight hours for the most critical systems—was not achieved by trusting any single component to be infallible. It was achieved by a discipline that begins with a principled decomposition of failure modes and then applies different engineering strategies to each. The decomposition is formal and legally enforced.

DO-178C, the governing standard for airborne software, draws the first line. 5 Systematic failures—those caused by design decisions, software logic, or requirements defects—are deterministic: given the same inputs, they reproduce. They are not assigned a probability. Instead, they are subjected to design assurance: requirements-based verification (tests derived from requirements, never from code), bidirectional traceability (every requirement maps to implementation and test, every test maps back to a requirement), and structural coverage analysis (at the highest assurance level, modified condition/decision coverage ensures every logical branch is exercised). The goal is not to reduce the probability of systematic error to an acceptable level; it is to drive it to zero through evidence that the requirements are correctly implemented. Any residual wrongness traces to a named, auditable requirements defect—never to the software itself, which is considered to have executed its specification faithfully.

ARP4754A and ARP4761 govern the system-level safety assessment that determines what assurance level each function requires. 7 8 The failure hazard analysis classifies failure conditions by severity: catastrophic (loss of aircraft or lives), hazardous, major, minor, no safety effect. Each severity maps to a Design Assurance Level—DAL A through E—with DAL A demanding the most rigorous process. AC 25.1309 then sets the allowable probability for random failure at each severity level: catastrophic conditions must have a probability not exceeding 10⁻⁹ per flight hour; hazardous conditions must not exceed 10⁻⁷; major conditions must not exceed 10⁻⁵. 6 These are not targets—they are ceilings, legally binding, validated through fault trees and safety assessment processes.

The concept of Hazardously Misleading Information (HMI) deserves particular attention because it maps directly to hallucination at the system level. HMI is not simply wrong information; it is wrong information delivered without a warning flag—information the pilot has no reason to distrust. A navigation system that reports the aircraft’s position as 0.1 nautical miles left of the approach path when it is actually 1.1 nautical miles left, without triggering an integrity alert, has produced HMI. The pilot, trusting the display, descends toward terrain. The failure is not that the system was wrong. Systems are sometimes wrong. The failure is that the system was wrong and did not say so.

Required Navigation Performance (RNP) makes the integrity requirement concrete and numerical. 9 10 An RNP approach requires that the aircraft stay within a defined containment value of the intended path—approximately twice the 95% accuracy value—and that the probability of being outside that containment without a timely alert be below 10⁻⁵ to 10⁻⁷ per approach. The Receiver Autonomous Integrity Monitoring (RAIM) system cross-checks redundant satellite measurements via an independent consistency computation to detect whether the position estimate can be trusted. If integrity cannot be confirmed—if there are insufficient satellites or if self-consistency fails—the system issues an alert and the pilot must execute a missed approach. The system does not guess and hope. It abstains.

Envelope protection, as implemented in fly-by-wire aircraft, is the deterministic bucket made physical. The control laws contain hard limits—maximum pitch, minimum speed, maximum bank angle—that the control system will not allow the pilot to exceed regardless of sidestick input. These are not warnings. They are structural prohibitions. The envelope protection layer does not evaluate the probability that exceeding the limit would be harmful and decide whether the expected cost justifies refusal. It simply refuses. This is default-deny implemented in certified flight-control hardware, at the highest applicable assurance level.

The Swiss-cheese model completes the picture. No single layer is perfect; each has its characteristic holes. The system’s safety depends on the holes in independent layers not aligning. RAIM catches what the primary GPS channel misses; the independent vertical situation indicator catches what RAIM misses; terrain awareness catches what the vertical situation indicator misses. Defense in depth, not single-point perfection. The independence of each layer is load-bearing—a correlated failure in two layers is no better than a single-layer failure.

Section 04

Banking Solved Model Risk

The 2008 financial crisis made explicit what practitioners had long understood in quieter forms: a model output that is not independently challenged can, if widely trusted, become systematically dangerous at scale. The regulatory response, SR 11-7 issued jointly by the Federal Reserve and the Office of the Comptroller of the Currency in 2011, codified a governance framework that makes no conceptual distinction between a risk model and any other system that produces consequential estimates. 11 Its language is applicable to language models without modification: a model is “a quantitative method, system, or approach that applies statistical, economic, financial, or mathematical theories, techniques, and assumptions to process input data into quantitative estimates.” A language model is a quantitative system that applies mathematical techniques to process input data into token-sequence estimates. SR 11-7 covers it.

The framework’s central demand is effective challenge: independent review and testing of model development, assumptions, outputs, and use. The reviewer must be structurally independent of the model developer—no self-grading. This mirrors aviation’s independent verification and validation requirement, and it reflects the same epistemic insight: a system cannot detect its own blind spots. If the monitor is built from the same assumptions as the generator, it will fail precisely where the generator fails. Four-eyes and maker-checker are not bureaucratic formalities; they are the engineering response to the mathematics of correlated failure. SR 11-7 also establishes that a bank cannot transfer its model risk obligation by purchasing a third-party system. Vendor models fall within the same framework. Governance is not dischargeable by procurement.

The Basel framework for market risk backtesting provides the mechanism for validating that a model’s stated confidence interval is actually calibrated. Value-at-Risk is a quantile estimate: the model states that losses will not exceed a threshold with a stated probability—typically 99%—over a one-day horizon. Kupiec’s proportion-of-failures test checks, unconditionally, whether the empirical exception rate matches the model’s stated tolerance. 12 Under a 99% VaR, roughly 2.5 exceptions per year are expected; substantially more exceptions mean the model is understating risk. Christoffersen’s conditional coverage test adds a clustering check: exceptions should not cluster in time, because real portfolio losses cluster when regimes shift, and a model that misses clustering is miscalibrated precisely when it matters most. 13

The Basel traffic-light system translates these statistical tests into a supervisory action framework. 14 A bank running a 99% VaR model over 250 trading days expects zero to four exceptions if the model is well-calibrated. Five to nine exceptions (yellow zone) trigger a capital multiplier increase. Ten or more exceptions (red zone) trigger immediate recalibration requirements. The regulator does not accept the model’s assertion that it is well-calibrated; it demands evidence in the form of measured exception rates, and it imposes costs when the evidence fails. This is governance-by-evidence, not governance-by-assurance.

The Fundamental Review of the Trading Book (FRTB) replaced VaR with Expected Shortfall at the 97.5th percentile as the primary risk metric for internal models. 15 The technical motivation is directly relevant to AI: VaR is an average at the threshold—it characterizes the boundary of acceptable losses but says nothing about how bad losses can be beyond that boundary. Expected Shortfall is the average of losses that exceed the VaR threshold—it directly measures the tail. A model that is well-calibrated on average can be systematically miscalibrated in the tail, and VaR will not catch this. FRTB caught it. The AI parallel is exact: a model with 98% average accuracy can be systematically wrong in the 2% of cases that matter most. The appropriate measure is not mean accuracy; it is the rate and character of the worst failures.

Abstention in AI corresponds to conservative reserving in banking: the discipline of declining to book a risk you cannot price, routing it to a suspense queue rather than forcing it through a model that was not designed to handle it. A loan officer who says “this counterparty falls outside our underwriting criteria—refer to credit committee” is performing the same function as an AI system that says “this query falls outside my calibration domain—I cannot give you a reliable answer.” Both are treating epistemic limits as a governance signal, not as a failure to be suppressed.

Section 05

The Integrity Axis: Bound the Undetected Error, Not Average Accuracy

Both industries independently identified the same axis as the critical one, and gave it the same name: integrity. Accuracy—the raw rate of correct outputs—is a useful summary statistic. Integrity is something harder to achieve and more important to guarantee. Integrity bounds the rate of errors that escape detection: the rate of assertions that are false and that the monitoring system failed to catch and flag. The relationship between the two is not fixed; it depends on the monitoring architecture. A system can have high accuracy and low integrity if its monitor is poorly calibrated. A system can have modest accuracy and high integrity if its monitor reliably flags the cases it gets wrong.

Aviation’s distinction makes this concrete. RNP accuracy is the fraction of flight time during which the aircraft’s actual position is within the Total System Error bound—typically the 95th percentile. 9 RNP integrity is the probability that the aircraft is outside the containment value—a distance approximately twice the 95% accuracy value—without the system having issued an alert. The first is a coverage property; the second is a miss-rate property. Note the two-dimensional tightening: the containment distance is ~2× the accuracy distance, while the undetected-breach probability tolerance is orders of magnitude tighter still. A 95% accuracy requirement alongside a 10⁻⁷ integrity requirement means that the system must be right 95% of the time but, when it is wrong in a way that matters, it must almost never fail to say so. 6

The consequence-scaling logic follows directly. A failure condition that is catastrophic justifies a tighter integrity tolerance than one that is major. DAL A, applied to catastrophic failure conditions, demands the most rigorous verification process and the tightest probability bounds. DAL E, applied to conditions with no safety effect, demands nothing beyond good engineering practice. The tolerance is not a fixed number; it is a function of the cost of an undetected miss. This is risk-proportionate governance, not one-size-fits-all compliance.

Banking’s FRTB Expected Shortfall captures the same insight from the quantitative side. 15 The ordinary VaR exceedance rate is an accuracy measure—how often does the loss exceed the stated threshold? Expected Shortfall characterizes the tail: when the threshold is exceeded, how bad is it? A model that systematically underestimates tail severity has poor “tail integrity” even if its exceedance rate is technically acceptable. The regulatory move from VaR to Expected Shortfall was, in substance, a move from accuracy governance to integrity governance.

For AI systems, conformal abstention operationalizes the integrity target directly. Abbasi-Yadkori et al. (2024) show that a conformally calibrated abstention policy can bound the rate of hallucinations among answered queries to a pre-specified tolerance—not as an average claim, but as a finite-sample, distribution-free guarantee. 16 Yu et al. (2024) cast factuality testing as a hypothesis test with controlled Type-I error—the rate of falsely certified incorrect claims—again with distribution-free finite-sample guarantees and a covariate-shift extension that applies when the deployment distribution differs from the calibration distribution. 17 Both papers are, in the vocabulary of aviation, implementing an integrity requirement: bound the rate of undetected false assertions, not the aggregate error rate.

The consequence-scaled tolerance applies here as well. An AI system operating in a context where a false assertion could lead to a medication error should be calibrated to a tighter integrity tolerance than one operating in a context where false assertions are editorially embarrassing. The architecture permits this: the tolerance parameter is set by the governance layer, and the gate enforces it by abstaining more aggressively as the tolerance tightens. Blast-radius tiering is DAL assignment by another name.

Section 06

The Mathematics Is Shared: Neyman–Pearson, Conformal, RAIM, Kupiec

The cross-discipline correspondence described in the previous sections is not merely analogical. Three fields, working independently on what they perceived as distinct engineering problems, converged on the same decision-theoretic form: a Neyman–Pearson threshold on a nonconformity statistic. The accuracy/integrity and conditional-coverage notions are related strengthenings of marginal coverage, not literally one object—but the shared decision rule is real and forced by the problem’s structure, not by any particular community’s conventions. This convergence is evidence that the solution is correct.

The selective prediction problem on the probabilistic bucket asks for a gate that minimizes false assertion rate among answered queries subject to a coverage constraint. Its Lagrangian relaxation reduces, under the Neyman–Pearson lemma, to a threshold test on a nonconformity statistic: assert when the statistic falls below a threshold, abstain when it exceeds it. The optimal threshold is the one that places the probability of correct assertion at the target level while maximizing coverage. 3 4

Split conformal prediction—introduced by Vovk, Gammerman, and Shafer (2005) and developed in the regression setting by Papadopoulos et al. (2002)—gives this threshold a distribution-free finite-sample guarantee. 18 19 Calibrate on a held-out set of exchangeable examples. Compute a nonconformity score for each. Set the threshold at the appropriate quantile of the calibration scores. By rank-based uniformity arguments alone—requiring no parametric assumptions, no knowledge of the data-generating distribution, no asymptotic approximation—the fraction of new examples whose nonconformity score exceeds the threshold is bounded above by the nominal tolerance. The guarantee is exact in finite samples. Mohri and Hashimoto (2024) apply this machinery directly to language model factuality, obtaining coverage guarantees on the set of queries a model is permitted to answer. 20

RAIM implements the same threshold test on a different nonconformity statistic. The statistic is the sum of squared pseudorange residuals—the inconsistency among redundant satellite measurements. Under a null hypothesis of no satellite failure, this statistic follows a known distribution. If the statistic exceeds a threshold calibrated to the desired miss-rate, the system alerts and abstains from providing a position fix. The threshold selection is a Neyman–Pearson problem: minimize the miss-rate (probability of a satellite failure going undetected) subject to a constraint on the false alert rate (probability of alerting when all satellites are healthy). The structural mathematics is identical to conformal prediction; the nonconformity statistic measures a different kind of self-consistency, but the decision rule is the same threshold test, calibrated to the same tolerance targets. 9 10

Basel backtesting is a coverage test in the conformal-prediction sense, stated as a likelihood ratio. 12 13 Kupiec’s proportion-of-failures statistic is a likelihood ratio test of whether the empirical VaR exception rate matches the stated confidence level—under the null hypothesis of a correctly calibrated model, it follows a chi-squared distribution with one degree of freedom, critical value 3.84. This is precisely the conformal prediction’s empirical coverage check, applied to financial loss quantiles. Christoffersen’s conditional coverage test adds an independence requirement—exceptions should not cluster—which corresponds to the exchangeability assumption that conformal prediction requires for its coverage guarantee to hold. When clustering is present, the effective sample size is smaller than the nominal one, and the guarantee degrades; the Christoffersen test is a diagnostic for that degradation. 13

The recalibration logic is also shared. Under distribution shift, the calibration scores are no longer exchangeable with the deployment scores, and the coverage guarantee degrades. Aviation’s response is the operational design domain restriction: declare the conditions under which the integrity guarantee holds, and revert or alert outside them. Banking’s response is the traffic-light capital multiplier: as the empirical exception count rises—as distribution shift reveals itself through exceedances—regulatory costs increase, forcing recalibration. 14 Conformal prediction’s response is weighted or online recalibration: down-weight older calibration examples that are no longer representative, or update the threshold continuously as new evidence arrives. Three approaches, one problem structure.

The correspondence extends to risk budget allocation. Aviation allocates the integrity risk budget across independent failure modes via fault trees: if a catastrophic condition requires a probability below 10⁻⁹, and it has three independent contributors, each must be below 10⁻³, not 10⁻⁹. Conformal prediction’s equivalent is Bonferroni-style alpha allocation across claim types: if the system-level hallucination tolerance is 1% and there are N independent claim categories, each category is allocated a fraction of the budget. Both approaches enforce the additivity of independent probabilities across a compound system.

Section 07

The Information-Theoretic Floor: When Abstention Is the Only Sound Move

The decomposition of claims into the D and P buckets is not arbitrary. It has a precise information-theoretic definition. A claim belongs to D if the observable evidence at delivery time is sufficient to determine its truth with certainty—if the statistical distance between the evidence distributions under “claim true” and “claim false” is one. It belongs to P when that distance is less than one. The boundary is set by the evidence, not by convention.

Le Cam’s two-point method and the Fano inequality—standard information-theoretic results—establish a lower bound on any monitor’s miss-rate as a function of the statistical distance between the evidence distributions. If μ₀ is the distribution of observable evidence when a claim is false, and μ₁ is the distribution when it is true, then no monitor—regardless of its architecture, parameters, or computational budget—can achieve an undetected-error rate below a floor determined by the total variation distance TV(μ₀, μ₁). When TV is close to zero—when the evidence looks nearly the same whether the claim is true or false—no monitor can do much better than chance. When TV is close to one—when the evidence clearly distinguishes the two cases—a near-perfect monitor is possible.

This is precisely why the D/P split exists. Claims in D have TV near one: a schema validator, for example, is a channel whose output is perfectly correlated with schema conformance. Claims in P have TV less than one, sometimes dramatically less. A claim about a specific historical event, asserted by a model whose only evidence is the statistical regularity of its training distribution, has a TV that depends on how distinctively that event left traces in the training data. For rare events, events that were recorded inaccurately, or events with no corroborating documentation, TV may approach zero. For those claims, the floor on the undetected-error probability approaches ½ — the monitor’s flag carries essentially no information about the claim’s truth, so the only sound move is abstention.

Kalai et al.’s singleton-fact precondition recovers from this information-theoretic argument. 1 A fact observed exactly once in training, with no corroborating channel, contributes a model-output distribution that is nearly identical whether the fact is true or false—the model has seen only one signal, and that signal is the fact itself. The TV between “fact true” and “fact false” evidence distributions is near zero. No monitor working from model outputs alone can reliably detect when such a fact is hallucinated. The claim is structurally in P and, given the near-zero TV, structurally near the floor where abstention becomes mandatory.

The data-processing inequality closes the argument. A gate is post-processing: it applies a function to the model’s output. No function applied to a channel’s output can increase the mutual information between that output and the ground truth—post-processing can only preserve or reduce information, never create it. A gate that asserts confidently on claims where the evidence is ambiguous is not certifying the claims; it is asserting a confidence that the evidence does not support. The information is not there to be extracted by a cleverer gate design. The gate can only route—it can decide whether to present the uncertain output or to withhold it—but it cannot manufacture the evidentiary basis for certainty that the generation process did not provide.

Aviation and banking independently engineered their way toward higher TV, without naming it that. Aviation’s requirement for redundant, independent measurements—multiple independent satellite pseudoranges in RAIM, independent terrain awareness systems, independent crew monitoring—is engineering to increase the TV between “failure mode” and “no failure mode” distributions in the aggregate observable channel. The more independent observations, the larger the KL divergence between distributions, the lower the Fano floor, the better any monitor can perform. Banking’s requirement for independent P&L attribution—a hypothetical P&L computed from the model alone, compared against the actual P&L including market moves—creates a second, independent channel for detecting model miscalibration. Both industries invested heavily in creating new evidentiary channels precisely because the Fano floor on their existing channels was not low enough for the consequences they were managing.

Section 08

The Ethics Is Over-Determined: A Kantian Floor, a Utilitarian Objective

The architecture described in the preceding sections—deterministic floor, bounded probabilistic residual, independent monitoring, mandatory abstention, consequence-scaled tolerance—is not a pragmatic engineering compromise. It is what ethics requires, under two rival traditions that ordinarily disagree. The convergence of the consequentialist and deontological arguments on the same design is the strongest available robustness result: the architecture is invariant under your choice of meta-ethics.

The consequentialist argument is the more familiar one. Set the gate’s tolerance by expected disutility: how bad is a false assertion of this type, how often will it occur, and what is the aggregate cost? Scale the abstention threshold until the expected cost of false assertions among answered queries is acceptable. This framing is instrumentally coherent and practically tractable. It is also, on its own, unstable. Expected-disutility minimization will trade a catastrophic outcome affecting a small number of people against a large aggregate benefit accruing to many—if the numbers work out, it will do so even when the catastrophic outcome involves treating a specific person as a means to the aggregate end. The integrity floor—the constraint that undetected catastrophic false assertions must be held below a hard ceiling, not merely minimized in expectation—has to be bolted onto the utilitarian framework as a separate constraint. It does not emerge from the framework’s own logic.

Kant supplies what the consequentialist structure cannot. In the Kantian frame, rights and strict duties are side-constraints, not goods to be maximized: they bound the feasible region rather than entering the objective. 21 24 The integrity floor is precisely this: a constraint the optimization runs inside, never a quantity it trades off. The utilitarian objective governs the choice within the feasible region; the Kantian constraints define the region’s boundary. As Nozick observed, rights function as side-constraints on the pursuit of aggregate welfare—they are not just very large negative utilities to be weighed against large positive benefits; they are structural prohibitions. 24 Aviation’s catastrophic failure ceiling and banking’s knockout rules are both examples of this structure: they are not the output of an optimization that happened to produce a very low probability; they are hard floors that the optimization cannot breach.

Kant’s distinction between Würde (dignity) and Preis (price) adds a second, orthogonal dimension to the D/P decomposition. 21 Claims that, if false, would treat a person merely as a means—fabricated evidence of wrongdoing, medical falsehoods, missed sanctions hits that enable harm—carry a dignity constraint that is not fungible at any accuracy rate. These claims demand treatment as D-claims even when they are epistemically in P: when no oracle is available, the only sound response is abstention. The information-theoretic argument and the deontological argument converge on the same action for independent reasons. A guarantee that two independent frameworks demand is far more robust than either alone.

The Categorical Imperative’s formulations map to architectural requirements. The Universal Law formulation asks whether the emission-maxim could be willed universally: a maxim of “assert confidently regardless of evidentiary basis” cannot be universalized without destroying the institution of assertion itself—an institution that depends on asserters believing what they say. 21 Harry Frankfurt’s analysis of bullshit—indifference to the distinction between truth and falsehood, as distinct from lying—is, under this formulation, a failure of the Universal Law. Frankfurt’s parasitism point and Kant’s contradiction-in-conception converge here: bluffing’s success presupposes the very trust its universalization would destroy. 28 Abstention universalizes; bluffing does not. This is not a contingent empirical claim; it is a structural feature of what assertion means.

The Humanity formulation demands that users be treated as ends in themselves, not merely as means. 21 Projecting false confidence treats the user as a means to a helpful-seeming interaction: the system performs helpfulness at the user’s epistemic expense. Surfacing calibrated uncertainty—saying “my confidence in this answer is 70%, and here is why”—respects the user as a rational agent capable of incorporating that uncertainty into their decisions. The reattribution of residual error to the specification is, under this formulation, an act of respect for user autonomy: it places accountability where it belongs rather than hiding it behind false confidence.

The Publicity formulation, from Kant’s “Toward Perpetual Peace,” holds that a maxim is impermissible if it cannot survive public disclosure. 22 Per-emission traceability—a certificate that traces every output to the governing rule that authorized it—is the publicity principle computationalized. Any output that could not survive the public disclosure of its governing rule should not be emitted. This is a stronger constraint than consistency: it demands that the authorization itself be defensible under scrutiny.

The Kingdom of Ends formulation demands that governance rules be co-legislated by all those subject to them. 21 The rules that set the gate’s tolerances and classify the severity of different error types must be ones that the affected parties could endorse as co-legislators—not an externally imposed utility sum. This is the contractualist refinement: Rawls’s Kantian interpretation of justice as fairness and Scanlon’s principles no one could reasonably reject both express the same demand. 25 26 The practical form of this requirement is participatory governance of the value specification: the people who bear the consequences of false assertions should have voice in setting the tolerances. Anthropic’s effort to widen the conversation about how AI systems should behave is, in the vocabulary of the Kingdom of Ends, an effort toward co-legislation. 27

Three genuine tensions deserve acknowledgment. First, a strict Kantian absolutist, reading Kant’s 1797 essay on lying, might hold that any assertion made with nonzero probability of falsehood is impermissible. 23 The defensible resolution is that a calibrated, hedged assertion—one that surfaces its uncertainty rather than claiming certainty it lacks—does not assert what the speaker does not believe. Honest uncertainty is not a lie; bluffed confidence is. The Kantian indictment falls on false confidence, not on calibrated probability estimates. Second, even within the feasible region, the optimizer trades coverage against error, one claim’s risk against another’s; the dignity axis removes dignity-touching claims from this trade, but that requires the router to make a moral classification, not merely an epistemic one. Routing soundness is now both technically and ethically load-bearing. Third, and most honestly: the Kantian apparatus presupposes an autonomous rational will as its subject. The humans in the governance loop—designers, specification authors, operators—are the moral agents; the AI system is the instrument through which they discharge their duties. Moral responsibility for the architecture’s outputs rests with the humans who built and operate it, not with the system.

Section 09

Governed AI You Can Certify: The Posture and Its Honest Limits

The architecture that emerges from the preceding analysis has three components, each corresponding to a part of the guarantee. On the deterministic bucket: conformance gates implemented as policy-as-code (OPA/Rego or equivalent) that check every D-claim before it is emitted, with default-deny on bucket-membership ambiguity. On the probabilistic bucket: a conformally calibrated abstention policy with a tolerance set by consequence—tighter for high-blast-radius claims, looser for low-stakes ones. On the traceability dimension: a per-emission certificate that traces every output back to the governing rule that authorized it, so that the authorization can be reviewed, audited, and, when necessary, corrected.

The three-part guarantee is precise. First, zero false assertions on the governed/checkable class: if the gate is sound and default-deny, D-claims are never falsely asserted. The residual from specification errors is acknowledged and traced to the specification. Second, bounded and measured error on the probabilistic class: the rate of undetected false assertions among answered P-claims is held below the consequence-scaled integrity tolerance, with the bound validated on calibration data and monitored continuously in deployment. Third, every output is traceable to the authorizing rule: there is no “black box” decision that cannot be inspected and defended. 27

This posture operationalizes in five concrete primitives. Conformance gates are the deterministic enforcement layer. A stop-hook DAG is an independent checkpoint invoked before consequential actions—structurally analogous to RAIM and to banking’s four-eyes review: it fires independently of the generator, after the generator has produced its output but before that output is committed to a consequential action. Blast-radius tiering scales the gate’s abstention threshold to the severity of a potential false assertion, implementing the DAL/materiality logic at the application layer. A Trust Dial exposes the coverage/integrity trade-off as a governance parameter, allowing operators to set how conservative the gate is by consequence class. Repo-Audit and field-level liveness contracts provide the traceability certificate—every output traces to the rule that authorized it.

The honest limits deserve the same prominence as the guarantees.

Routing soundness is the single most critical load-bearing assumption. If a P-claim is misclassified as a D-claim and routed to the zero-error path, the system will emit it without the probabilistic tolerance that should apply, and the guarantee on D-claims is violated. The router itself is a classifier, and classifiers can err. The architectural response is default-deny on routing ambiguity: when the routing decision is uncertain, treat the claim as P and apply the probabilistic tolerance, rather than treating it as D and asserting it unconditionally. This means some D-claims will be handled conservatively when the router is uncertain, accepting a coverage cost to protect the integrity guarantee. Where dignity-touching claims are at stake, the router faces an additional burden: classifying whether a claim implicates a person’s dignity is a moral classification, not merely an epistemic one—strictly harder and less verifiable than routing on evidentiary grounds alone. For such claims, default-deny on moral ambiguity becomes a duty rather than a conservative option.

The conformal guarantee rests on exchangeability. Conformal prediction’s finite-sample coverage guarantee holds when the calibration examples and deployment examples are drawn exchangeably from the same distribution. Under distribution shift—when the deployment distribution differs from the calibration distribution—the guarantee degrades. This is not a failure of the mathematical framework; it is an honest description of its scope. The mitigation is to treat distribution shift as a monitored, recalibrated risk: measure the empirical exception rate in deployment continuously, compare it against the calibrated tolerance, and trigger recalibration when the two diverge. This is exactly what aviation does with onboard performance monitoring and what banking does with the traffic-light test. Distribution shift is not a solved problem; it is a managed one.

Monitor independence sets the Fano floor. The integrity guarantee is only as good as the independence of the monitoring channel from the generation channel. A monitor built from the same model, fine-tuned on the same data, inherits the same blind spots. When the model hallucinates for a systematic reason—a systematic gap in training data, a consistent bias in the pre-training corpus—a correlated monitor will fail in the same cases. The Fano floor is set by the KL divergence between the true and false evidence distributions in the monitoring channel, and a correlated channel has a lower KL divergence than an independent one. Genuine independence—dissimilar architectures, independent data sources, independent human review for the highest-consequence decisions—is not organizational overhead; it is the mechanism that makes the floor lower.

Finally, the zero-false-assertion guarantee on D is conditional on the completeness and correctness of the oracle. If a verifiable claim is incorrectly classified as unverifiable, or if the oracle itself contains a specification error, the guarantee on D does not hold for that claim. The guarantee is correctness relative to specification, not correctness relative to ground truth. This is not a weakness unique to this architecture—it is the honest statement of what any formal verification system provides. The value is that errors are localized to named, auditable, fixable specifications rather than diffused into opaque model behavior.

The closing argument is the one that aviation and banking each earned through decades of practice: govern AI behavior like a safety-critical system, and audit it like a regulated one. Not because the analogy is useful rhetorically, but because the problems are structurally identical and the solutions have been verified at scale. The discipline is available. The mathematics is shared. The precedent is certified.

End of paper↑ Back to top

Governed Like Aviation, Audited Like Banking

Context

The Finding