Accuracy Is the Wrong Variable
Begin with the arithmetic that makes the headline number deceptive. Suppose a screening model for a serious but uncommon condition — say one case in two hundred patients — and suppose the model simply predicts no disease for everyone. It will be correct on one hundred ninety-nine of every two hundred patients and post an accuracy of ninety-nine and a half percent. It will also miss every single case of the disease. Accuracy rewarded the model for the easy negatives and charged it nothing for the catastrophic misses, because accuracy is a count of correct calls that is blind to which calls they were. In a domain where the rare positive is the entire reason the system exists, a metric dominated by the common negative measures the wrong thing with great precision.
Clinical medicine retired the single-number summary long ago and replaced it with a pair that pulls the errors apart. Sensitivity is the fraction of true cases the test catches — true positives over true positives plus false negatives — and it is the term the no-disease model drove to zero. Specificity is the fraction of true non-cases the test clears. 56 The two are in tension, and the tension is where the harm lives: a test tuned for high specificity raises few false alarms but lets true cases slip through as false negatives, while a test tuned for high sensitivity catches the cases at the cost of a flood of false positives. A single accuracy figure hides which side of that trade the model sits on, and therefore hides whether its errors are the survivable kind or the lethal kind.
The predictive values complete the picture and explain why even a strong sensitivity and specificity can mislead. Positive predictive value is the probability that a flagged patient truly has the condition; negative predictive value is the probability that a cleared patient truly does not. 6 Unlike sensitivity and specificity, the predictive values depend on prevalence — apply a fixed test to a population where the disease is rare and the positive predictive value collapses, because the false positives swamp the few true ones. 516 This is why a model can be validated as accurate in one setting and become a generator of false alarms in another without a single weight changing: the harm profile is a property of the deployment, not only of the model, and a number computed on the development set does not travel.
Ninety-five percent accuracy tells you how often the model is right. It tells you nothing about what it costs when the model is wrong — and in medicine, that cost is the only thing that matters.
The asymmetry has a quantitative name that the engineering side of AI rarely uses and the clinical side never forgets. The Number Needed to Harm is the count of patients exposed to an intervention to produce one additional adverse outcome — the reciprocal of the absolute increase in risk the intervention causes. 7 Read it as the unit in which a false negative is priced. If a missed diagnosis foregoes a treatment that would have prevented a death, the harm attached to that single false negative can be measured in lives, while the harm of a false positive — an unnecessary follow-up test, a course of monitoring — is measured in inconvenience and cost. The magnitudes are not rhetorical. In sepsis, a condition where a screening model's whole purpose is early detection, each additional hour to antibiotics is associated with measurably higher in-hospital mortality across tens of thousands of patients. 24 A false negative that delays recognition does not cost a percentage point of accuracy; it costs hours, and the hours cost lives. A governing metric that treats those two errors as interchangeable units of inaccuracy is committing a category error. The errors differ not in frequency but in consequence, and consequence is exactly what accuracy refuses to see.
This is not a fringe statistical objection; it is the settled position of the clinical-AI ethics literature. When machine learning entered medicine in earnest, the first careful papers warned that a model must be evaluated by its consequences for patients rather than by its technical performance on a held-out set, because the two can diverge sharply, and the divergence always favors the metric over the patient. 21 A model optimized to maximize accuracy or AUROC is optimized for a quantity that is silent about who is harmed when it errs, and an optimizer that cannot see harm will trade a rare catastrophic miss for a marginal gain on the common case every time, because that trade improves the number it was told to improve. The harm- weighted view is the correction the field reached for almost immediately, and it is the same correction the regulated-device world had already codified decades earlier.
ISO 14971: Risk Is Severity Times Probability
The medical-device industry did not arrive at the harm-weighted view by argument; it was forced there by the standard that governs every device sold into a regulated market. ISO 14971, now in its 2019 third edition, defines risk in one sentence that ends the accuracy debate before it begins: risk is the combination of the probability of occurrence of harm and the severity of that harm.1 Two terms, multiplied, not one. A failure that is common but trivial and a failure that is rare but fatal can carry the same nominal error rate and occupy entirely different positions on the risk grid, because the grid has a severity axis that an accuracy score does not.
The standard operationalizes this as a closed-loop process rather than a one-time score. A risk-management plan names the device and its intended use, sets objective acceptability criteria, and fixes the methods by which risk will be estimated. Risk analysis then estimates, for each identified hazard, both severity and probability — quantitatively where the data permit, qualitatively where they do not, on graded scales that typically run from negligible to catastrophic in severity and from improbable to frequent in probability. 1 Risk evaluation compares each estimate against the predefined acceptability criteria and decides whether control measures are required. Every step, every estimate, every residual risk accepted, lands in a risk management file — a durable, traceable record that an auditor can replay, which is the device industry's version of the decision trail that governs a model under bank supervision. 113
The reframe this forces onto clinical AI is precise. A diagnostic model is not a classifier with an accuracy figure; it is a medical device whose every misclassification is a hazard with a severity. A false negative on a sepsis predictor and a false positive on the same predictor are two different hazards with two different severities, and ISO 14971 demands they be estimated and controlled separately. 1 The failure mode the standard is most concerned with is the one this paper opened on: information presented to a clinician that is wrong, consequential, and unflagged — the medical analogue of the aviation display that shows Hazardously Misleading Information, an output past its tolerance with no alert to say so. 12 Under the harm- weighted view, that is the worst cell on the grid: high severity, and — because there is no warning — a probability of reaching the decision-maker of one.
This is the same object that aviation software calls integrity. DO-178C does not certify airborne software by measuring how often it is correct; it sets verification rigor by the severity of what happens when the software is wrong — Design Assurance Level A for catastrophic effects down to E for none — and treats the bound on undetected error reaching an output as the quantity to be controlled. 11 Severity drives rigor; undetected error is the failure of integrity. ISO 14971 and DO-178C are two regulated industries reaching the identical conclusion from opposite ends of the same problem: you do not govern a safety-critical system by its average correctness. You govern it by the harm of its failures and the rate at which those failures escape detection.
One feature of the standard deserves emphasis because it is the discipline accuracy lacks entirely: the acceptability criteria are set before the analysis, in the risk-management plan, not inferred afterward from whatever the model happened to score. 1 A team must declare, in advance and in writing, which residual risks it will accept and which it will not, and then hold the device to that bar. An accuracy figure inverts this: the number is computed first and a threshold of acceptability is reverse-engineered to clear it, which is how a model with a reported area under the curve in the high seventies sails through a procurement review with no one having stated, in advance, what false-negative rate would be tolerable for the population it would screen. The pre-committed acceptability criterion is the same instrument bank supervision uses when it fixes a Value-at-Risk exception budget before the backtest runs rather than after — the tolerance is a commitment, not a result, and a commitment cannot be quietly relaxed to fit a disappointing number. 113
Why AUROC Hides the Harm-Bearing Tail
The favored summary statistic of clinical machine learning is not accuracy but the area under the receiver-operating-characteristic curve — AUROC, or the c-statistic. It is a better number than accuracy because it is prevalence-independent and integrates performance across every decision threshold at once. It is also, for exactly that reason, a summary that averages away the harm. AUROC measures the probability that the model ranks a random true case above a random non-case. It is a property of discrimination across the whole curve, and a model can earn a respectable AUROC while performing badly at the one operating threshold a clinician will actually use.
Consider the harm-bearing tail directly. A deployed model runs at a single chosen threshold, and at that threshold it produces a specific count of false negatives — the missed cases — and a specific count of false positives — the false alarms. Those two counts are where every clinical consequence of the model lives. AUROC collapses the entire family of thresholds into one scalar and reports none of them. Two models with the same AUROC can have radically different false-negative rates at the threshold of use; one may miss a quarter of cases and the other two-thirds, and the c-statistic will not distinguish them. 416 The number that looks rigorous is silent on precisely the quantity — the rate of confidently-wrong negatives — that determines whether the model is safe.
This is the medical instance of a pattern a companion paper in this stack develops for autonomy budgets: an average is not a tail, and governing by the average is how the rare catastrophic case escapes the accounting entirely. 15 Banking made the same correction when it moved its market-risk measure from Value-at-Risk, a threshold, to Expected Shortfall, the average of the losses beyond the threshold — because a model can be well- calibrated on average and catastrophically wrong in the tail, and a threshold statistic will not see it. 15 AUROC is the clinical AUROC-shaped version of that blindness. It prices the model's discrimination on the comfortable middle of the curve and says nothing about the patient at the edge.
A high c-statistic certifies the model on the average case. The false negative is never the average case. It is the patient in the tail the summary statistic was built to smooth over.
The honest replacement is not a single number but a small set of them, reported at the deployed threshold and in the deployed population: sensitivity, specificity, positive and negative predictive value, and the false-negative count, each estimated on a representative validation set rather than inherited from the development paper. 516 The reason this matters is not statistical fastidiousness. It is that these are the quantities a risk-management file must carry under ISO 14971, and the quantities a clinician needs to weigh the harm of acting on the model against the harm of ignoring it. 1
There is a second quantity AUROC ignores entirely, and the clinical-prediction-model community has spent a decade trying to force it onto the page: calibration, the agreement between the risk a model predicts and the risk patients actually face. A model can rank patients correctly — high AUROC — while systematically overstating or understating their absolute risk, and a clinician acting on a miscalibrated probability acts on a number that means something other than what it says. The reporting standards built for exactly this problem, the TRIPOD+AI statement and its risk-of-bias companion PROBAST+AI, make calibration a required element of any external validation; yet surveys of the published literature find calibration assessed in only about a third of models, which means most deployed clinical models report discrimination and stay silent on whether their probabilities are true. 20 A harm-weighted governance regime cannot tolerate that silence, because the severity-weighted escape rate depends on the model's probabilities being what they claim. The next section is what happens when a model is deployed at scale on the strength of a reported number and the harm profile is never measured against reality.
The Epic Sepsis Model: Observed Versus Assumed
The Epic Sepsis Model is the cautionary case for the entire argument, because it failed in exactly the shape the harm- weighted view predicts and the accuracy view conceals. The model is a proprietary sepsis-prediction tool embedded in the Epic electronic health record and, by the time it was independently examined, deployed across hundreds of U.S. hospitals on the strength of a vendor-reported area under the curve in the range of 0.76 to 0.83. 2 That number is the assumed performance — respectable discrimination, the kind of figure that clears a procurement review and goes live.
In 2021, Wong and colleagues at the University of Michigan published an external validation in JAMA Internal Medicine, run over 27,697 patients across 38,455 hospitalizations. 2 The observed performance was a different model. The area under the curve was 0.63, far below the vendor's claim. 2 Sensitivity was 33 percent at the alerting threshold the hospital used — the model failed to identify roughly two-thirds of the patients who actually developed sepsis. 23 Each of those misses is a false negative on a condition where delayed treatment kills — measurably, by the hour — and each was delivered with no flag: the model simply did not alert, and a clinician relying on it to catch deterioration received silence where a warning was due. 24 This is Hazardously Misleading Information in clinical form — an output below its required integrity, presented as if the absence of an alert meant the absence of risk. 12
The other half of the harm profile was alert burden. The model generated alerts on 18 percent of all hospitalized patients, and the published analysis implied a clinician would work through on the order of one hundred and nine alerts to find a single patient the model correctly identified who would not otherwise have been recognized. 23 That is the false- positive tail, and its harm is not benign: a high-volume, low-yield alert stream produces alert fatigue, the documented failure mode in which clinicians, conditioned by a torrent of false alarms, begin to dismiss alerts reflexively — including the true ones. 3 The phenomenon is not anecdotal: a systematic review and meta-analysis of drug-drug-interaction alerts found a pooled physician override rate of ninety percent, with a confidence interval of eighty-five to ninety-five percent — clinicians dismiss nine in ten such alerts, including the appropriate ones, and acceptance falls measurably with each additional alert in an encounter. 19 A model that floods that channel is not merely producing false positives; it is eroding the channel's value for every other alert a clinician will ever see, which is a harm the false-positive count alone does not capture. The Epic model was simultaneously too quiet on the patients who were sick and too loud on the patients who were not, and both failures are invisible to the single AUROC figure that justified the deployment.
A reported c-statistic of 0.76, a real one of 0.63, two-thirds of sepsis cases missed, and alerts on a fifth of every patient — the gap between assumed and observed is the harm profile no one measured before the model went live at scale.
The lesson is not that the Epic model was uniquely bad. It is the structural shape of the failure: a high reported accuracy figure, an unmeasured harm profile in the actual deployment population, and no abstention — the model emitted a prediction on every patient and carried no mechanism to decline when its evidence was thin or out of distribution. Confidently wrong, at scale, with no warning. The external validation that surfaced all of this was not a research luxury; it was the missing risk-evaluation step that ISO 14971 requires and the missing outcomes analysis that bank supervision calls backtesting. 113 The harm was always there. It was simply never on the ledger until an independent team put it there.
A Second Failure: Synthetic Cases, Real Recommendations
The sepsis model failed on the screening side — wrong about who was sick. IBM's Watson for Oncology failed on the recommendation side, and the contrast sharpens the same point. According to internal company documents reported in 2018, Watson for Oncology produced multiple treatment recommendations that physicians and the company's own reviewers judged unsafe and incorrect — including, in one documented instance, a recommendation for a drug that risked severe or fatal hemorrhage in a patient already presenting with serious bleeding. 8 The recommendations were inconsistent with national treatment guidelines, and the system was in use or under evaluation at hundreds of hospitals while the issues were known internally. 8
The root cause is the cleanest illustration in the literature of a measured-on-the-wrong-distribution failure. Watson for Oncology was trained substantially on a small number of synthetic cases — hypothetical patients constructed by a handful of clinicians at a single institution — rather than on the outcomes of real, varied patients. 8 Whatever internal accuracy figure the synthetic cases supported was an answer to a question no patient asks. The harm profile that mattered was the rate of unsafe recommendations on real patients in real clinical contexts, and that rate was neither what the development data measured nor what was disclosed at the point of use.
Both cases collapse to the same three-part anatomy, and naming it is the point of pairing them. A high or untested reported performance, computed on a distribution that was not the deployment distribution; an unmeasured harm profile — no one had estimated the severity-weighted rate of the consequential errors in the real population; and no abstention — the system answered confidently on every input, including the ones where it should have declined. The generalization is unforgiving: high reported accuracy plus an unmeasured harm profile plus no mechanism to abstain produces confidently-wrong calls at scale, and the scale is what turns a model error into a clinical incident. 18 The medical lesson is the same one a companion paper draws for autonomy generally — measure failure on the distribution you will deploy on, price the tail, and build the system to refuse rather than guess when the evidence is thin. 15
Govern Integrity, Not Accuracy
The corrective is to change the governing variable. Stop asking how often is the model right and start asking how often is it wrong without a warning, and what does each such error cost. The first quantity is accuracy. The second is integrity, harm-weighted — and it is the quantity ISO 14971, DO-178C, and bank supervision all already govern, each under its own name. 11113 The KellerAI LLM-Agent Assurance Standard supplies the operational vocabulary for porting that variable into an AI system, and the mapping to clinical decision support is direct. 14
The first move is to set the escape tolerance — the acceptable rate of unflagged wrong outputs — by harm severity, per tier, rather than to one global number. A clinical claim is not a clinical claim; a contraindication warning carries a different severity than a formatting suggestion, and the tolerable rate of a silent error must scale to that severity the way DO-178C scales rigor to its Design Assurance Level and ISO 14971 scales control to the risk estimate. 111 An irreversible, high-consequence recommendation — a dosing instruction, a stop-treatment call — sits at the top of the lattice, where the tolerable rate of confident, unflagged error approaches zero and the default in the face of thin evidence is to abstain and escalate to a clinician.
The second move is to sort each claim by whether it has an exact oracle, because that determines how its integrity can be controlled. The assurance standard draws this as the Bucket A / Bucket B distinction, and clinical decision support divides along it cleanly. 14Bucket A is the deterministically checkable class — claims with an exact verifier. A great deal of clinical safety is Bucket A and is being left to a language model that should never have been asked: dose arithmetic is checkable against the drug's mg-per-kg bounds and the patient's weight; a contraindication or drug-interaction claim is checkable against a curated interaction database; an allergy conflict is a lookup. 14 Where an exact oracle exists, the escape rate is driven toward zero by the verifier, not estimated probabilistically — and a probabilistic tolerance is not permitted where a deterministic check is available. 14 That these checks work is not a hope; it is a measured result. Computerized order-entry systems with deterministic decision support — dose- range checking, drug-allergy and drug-drug-interaction screening, drug-disease contraindication checks — have reduced prescribing errors and adverse drug events by margins reported on the order of fifty to eighty percent in systematic reviews. 23 The lesson for clinical AI is pointed: a great deal of the safety envelope is already deterministically checkable, the checks are proven, and handing those claims to a probabilistic model that can only estimate is a regression from an oracle to a guess. Bucket B is the open-world class where no exact oracle exists — a differential diagnosis, a prognosis, a judgment call. There, the discipline is to bound a maximum acceptable escape rate, measure it by backtesting on a held-out, representative, adversarially-stressed set with a stated confidence interval, and route above-tolerance or out-of-envelope cases to an independent verifier or to abstention. 14
Govern the rate at which the model is wrong without a warning, weighted by what being wrong costs. The checkable, check deterministically; the unprovable, bound and backtest; the thin, abstain. That is integrity, and accuracy was never it.
The third move is independent, qualified verification — no self-grading. A model's own confidence is not a check on the model's own claim; the two share the same blind spots and fail together. 14 In a clinical setting the independent verifier is, depending on the tier, a deterministic checker — the dose-arithmetic and interaction oracles above, which are independent of any model at any tier — or a clinician for the irreversible, high-consequence call. The Epic and Watson failures were both, at root, systems grading their own homework at scale: a sepsis score with no independent outcomes check until an external team built one, and an oncology recommender whose only check was the synthetic data that produced it. 28 Integrity governance makes the verifier a required, independent party selected by the gate — not an afterthought, and never the model itself.
The Obligations, Mapped to the Clinic
The assurance standard expresses these moves as cross-cutting obligations, and four of them carry the weight for clinical decision support. Each is stated here as its identifier and its named obligation together, because the identifier without the name is opaque and the name without the identifier is unauditable. 14
OBL-RES-001 — reversibility and rollback. The escape tolerance is scaled to harm, and the rollback horizon is sized to severity. 14 A clinical action that can be cheaply undone — surfacing a suggestion a clinician will review before acting — sits in a lower tier than one whose effect is irreversible once committed. The medicine ordered and administered, the treatment stopped, the discharge signed cannot be rolled back, and the obligation pushes those actions to the top of the lattice, where the tolerable rate of unflagged error is the tightest and abstention is the default. The question the obligation forces is the right one: if this is wrong, can it be taken back, and if not, has the verification scaled to that?
OBL-IND-001 — independent qualified verification. No claim grades itself; the gate, not the actor, selects the verifier. 14 This is the obligation the two cautionary cases violated. It requires that a consequential clinical claim pass a verifier chosen by the governance layer and independent of the model that produced it, before the claim is acted on — the dose oracle, the interaction database, or the clinician, depending on tier. The model proposes; the independent verifier disposes; the model never certifies its own output as safe.
OBL-VQ-001 — verifier qualification. A deterministic checker is independent at any tier; a model verifier qualifies only if it is of distinct lineage and its measured error-correlation with the actor sits at or below threshold. 14 In the clinic this is the rule that keeps a second language model from being mistaken for a real check. Two models that share training data share blind spots and fail on the same hard cases, so a second model that is wrong in the same way as the first is not a verifier at all. The deterministic oracles — dose arithmetic, contraindication lookup — are the strongest verifiers precisely because they are a different kind of mechanism that cannot inherit the model's errors.
OBL-TIER-001 — tier derivation. The consequence tier is the maximum over reversibility, scope, and consequence, and when any axis is undetermined the tier defaults to the highest. 14 The default-to-highest rule is the safety property that matters most in medicine: an unknown reversibility is treated as irreversible, an unknown consequence as high. A clinical AI that cannot determine the blast radius of its own recommendation does not get to assume the recommendation is harmless; it inherits the most conservative tier until the surface is known. This is the structural inverse of the accuracy view, which assumes a model is fine until proven wrong. The integrity view assumes a consequential action is dangerous until the evidence and the verification say otherwise.
Read together, the four obligations are a clinical risk-management file in executable form. 1 OBL-TIER-001 performs the severity estimate ISO 14971 demands; OBL-RES-001 scales the control to it; OBL-IND-001 supplies the effective challenge that bank supervision and the device standard both require; and OBL-VQ-001 ensures the challenge is real rather than a second copy of the thing it is meant to check. 1314 The accuracy figure does not appear in any of them, because accuracy was never the variable being governed.
The Honest Limits
An argument that only flattered the harm-weighted view would not survive a clinician's scrutiny, and three limits bound it. The first is that severity itself is a judgment, not a measurement. ISO 14971's severity scale is calibrated by expert assessment, and reasonable clinicians will disagree about where a given misclassification sits. 1 The harm-weighted view does not dissolve that judgment; it relocates it from a hidden assumption inside an accuracy number to an explicit, recorded estimate in the risk file, where it can be argued with and revised. That is progress, not a solution — the call still has to be made by someone qualified to make it.
The second limit is that the harm-bearing tail is only estimable to the extent it is represented in the validation data. A backtest measures the escape rate on the cases it contains; a failure mode absent from the held-out set — a rare presentation, a subpopulation the development data underrepresented — is a tail the measurement never sampled, and no confidence interval over the represented cases will surface it. 1418 This is why dataset shift is a clinical safety issue and not a modeling footnote: a model validated in one hospital's population can degrade silently in another's as the case mix moves, which is exactly why the device and supervisory regimes require ongoing monitoring rather than a one-time validation. 18 The regulators have made this explicit and binding: the EU AI Act classifies clinical decision support and AI-enabled medical devices as high-risk, attaching conformity assessment, human-oversight, and post-market monitoring obligations that take full effect for medical-device AI within the next two years — ongoing surveillance is becoming a legal precondition of deployment, not a best practice. 22 The integrity number is a measured quantity relative to a sampled distribution, not a guarantee against the unsampled one.
The third limit is that deterministic verification only covers the checkable subset. Dose arithmetic and interaction lookups have exact oracles; a differential diagnosis does not, and the Bucket-B machinery that governs it bounds a probabilistic escape rate rather than driving it to zero. 14 The honest claim is therefore bounded: integrity governance drives the checkable errors toward zero and bounds the uncheckable ones to a consequence-scaled tolerance, monitored over time. It does not make a clinical model correct. It makes the model's failure rate a measured, harm-weighted, independently-verified quantity — which is the most any safety-critical discipline has ever promised, and far more than an accuracy score delivers. 111
These limits do not weaken the argument; they locate it. The harm- weighted view does not claim to eliminate clinical risk. It claims to put that risk on a ledger — severity estimated, escape rate bounded and backtested, verification independent, residual risk named and monitored — rather than to hide it inside a single number that looks reassuring and answers the wrong question. The residual is relocated to the specification, where it is visible and fixable, not buried in a confident summary. 1
The Engineering Posture Before Acting
Before deploying a clinical AI, the operator's posture should be set by harm, not by accuracy, and it collapses to four commitments. Budget failures by harm severity. Estimate the severity of each class of error the way ISO 14971 demands, set the tolerable rate of an unflagged wrong output to that severity per tier, and let the irreversible, high-consequence claims carry the tightest budget and the abstention default. 114Verify the checkable deterministically. Where an exact oracle exists — dose arithmetic, contraindication and interaction lookups, allergy conflicts — use it, drive that escape rate toward zero, and never hand a deterministically checkable safety claim to a model that can only estimate. 14Abstain when the evidence is thin. Build the system to decline and escalate to a clinician rather than to answer confidently out of distribution — the missing move in both the sepsis model and the oncology recommender. 28
The fourth commitment is the one that ties the others together and is the title of this paper. Measure integrity, not accuracy. The number that governs a clinical AI is not the fraction of cases it gets right. It is the bounded, harm-weighted rate at which it is wrong without a warning, measured on the distribution it will deploy on, verified by a party independent of the model, and monitored as that distribution moves. 11418 An accuracy figure can be high while that rate is lethal, which is precisely what the Epic Sepsis Model's gap between a reported 0.76 and an observed 0.63 — two-thirds of sepsis cases missed in silence — demonstrated at the scale of hundreds of hospitals. 23
This paper is the second in a three-part stack on clinical AI governance, and the three articles compose. The first, Intended Use Is the Envelope , establishes the FDA Software-as-a-Medical-Device principle that a model's indication is its operating boundary — the scope inside which any integrity claim is even meaningful. 9 The third, The Clinician Is the Diversion Airport , takes up human oversight, post-market surveillance, and the FDA's Predetermined Change Control Plan — the reachable fallback and the ongoing monitoring this paper's limits section insists on. 1017 The harm-tail accounting here rhymes with the ETOPS-and-banking sibling Priced in Failure-Rate Data , which prices the tail rather than the average in domains where being confidently wrong is expensive rather than lethal — and with the framework paper, The LLM-Agent Assurance Standard , which supplies the Bucket A / Bucket B sort, the tier lattice, and the obligations this paper maps onto the clinic. 1415
Medicine is the sharpest test of the whole discipline because it is the domain where being confidently wrong is lethal and irreversible — where the missed cancer cannot be un-missed and the fatal recommendation cannot be recalled. That is exactly why it is the right place to insist on the harder variable. The wrong question is how accurate is the model. The right question is how often is it wrong without a warning, what does each such error cost, who checked it, and could the patient have been harmed before anyone noticed. Govern that, and the autonomy you grant a clinical AI is autonomy you can defend at the bedside. Govern accuracy, and you are measuring the one thing that was never the point.
End of paper
↑ Back to top