Skip to main content
kellerai.blog

Ninety-Five Percent Accuracy, Two-Thirds of Sepsis Missed

A model reported at 95% accuracy says nothing about which 5% it gets wrong — and one missed cancer is not a thousand false alarms.

KellerAI White Paper · Engineering Discipline & Verification · Jun 2026

Context

A clinical model reported at 95% accuracy sounds safe and is almost meaningless — the figure says nothing about which 5% it gets wrong. One missed cancer is not the actuarial equal of a thousand benign false alarms; the harms are not on the same scale and do not net out. ISO 14971 has measured risk as severity times probability for decades.

The Finding

The governing variable is not accuracy but harm-weighted integrity — the bounded, severity-weighted rate at which a model is confidently wrong with no warning. Govern that, verify the checkable deterministically, and abstain when evidence is thin.

Tags:
Harm-Weighted RiskISO 14971Clinical AI SafetyIntegrity vs AccuracyDeterministic VerificationSensitivity & Specificity
Paper Details
CategoryEngineering Discipline & Verification
AudienceClinical-AI builders, medical-device risk leads, and health-system governance owners deploying decision-support models.
MethodDoctrine read-across — ISO 14971 harm-severity risk management and clinical validation metrics mapped to autonomous-agent escape-rate governance.
Length~2,000 · 8 min
Sections5
DateJun 2026
AuthorsKellerAI
Read the full paper
Section 01

Accuracy Is the Wrong Variable

Start with the arithmetic that makes the headline number deceptive. Take a screening model for a serious but uncommon condition — one case in two hundred patients — and let it predict no disease for everyone. It is correct on one hundred ninety-nine of every two hundred patients, posting an accuracy of ninety-nine and a half percent, and it misses every single case. Accuracy rewarded the easy negatives and charged nothing for the catastrophic misses, because it is a count of correct calls blind to which calls they were. In a domain where the rare positive is the entire reason the system exists, a metric dominated by the common negative measures the wrong thing with great precision.

Clinical medicine retired the single number long ago, replacing it with a pair that pulls the errors apart. Sensitivity is the fraction of true cases the test catches — the term the no-disease model drove to zero. Specificity is the fraction of true non-cases it clears. The two are in tension, and the tension is where the harm lives: tune for specificity and true cases slip through; tune for sensitivity and false alarms flood in. A single accuracy figure hides which side of that trade the model sits on, and therefore hides whether its errors are survivable or lethal. The predictive values depend on prevalence besides — apply a fixed test where the disease is rare and the chance a flagged patient is truly sick collapses. The harm profile is a property of the deployment, not only of the model, and a number computed on the development set does not travel.

Ninety-five percent accuracy tells you how often the model is right. It tells you nothing about what it costs when the model is wrong — and in medicine, that cost is the only thing that matters.

What the headline number conceals

The asymmetry has a name the clinical side never forgets and the engineering side rarely uses. The Number Needed to Harm is the count of patients exposed to an intervention to produce one additional adverse outcome — the unit in which a false negative is priced. A missed diagnosis that foregoes a life-saving treatment carries a harm measured in lives; the harm of a false positive — an unnecessary follow-up test — is measured in cost. In sepsis, each additional hour to antibiotics is associated with measurably higher in-hospital mortality across tens of thousands of patients: a false negative that delays recognition does not cost a percentage point of accuracy, it costs hours, and the hours cost lives. A metric that treats those two errors as interchangeable units of inaccuracy is committing a category error. They differ not in frequency but in consequence, and consequence is exactly what accuracy refuses to see.

Section 02

ISO 14971: Risk Is Severity Times Probability

The medical-device industry did not reason its way to the harm-weighted view; it was forced there by the standard that governs every device sold into a regulated market. ISO 14971 defines risk in one sentence that ends the accuracy debate before it begins: risk is the combination of the probability of occurrence of harm and the severity of that harm. Two terms, multiplied, not one. A failure that is common but trivial and a failure that is rare but fatal can carry the same nominal error rate and occupy entirely different positions on the risk grid, because the grid has a severity axis that an accuracy score does not.

The reframe this forces onto clinical AI is precise. A diagnostic model is not a classifier with an accuracy figure; it is a medical device whose every misclassification is a hazard with a severity. A false negative on a sepsis predictor and a false positive on the same predictor are two different hazards with two different severities, and the standard demands they be estimated and controlled separately. The failure mode it is most concerned with is the one this brief opened on: information presented to a clinician that is wrong, consequential, and unflagged. Aviation has a precise name for exactly this object — Hazardously Misleading Information, an output that exceeds its tolerance without a timely alert. Under the harm-weighted view it is the worst cell on the grid: high severity, and — because there is no warning — a probability of reaching the decision-maker of one.

One feature of the standard deserves emphasis because it is the discipline accuracy lacks entirely: the acceptability criteria are set before the analysis, in writing, not inferred afterward from whatever the model happened to score. An accuracy figure inverts this — the number is computed first and a threshold is reverse-engineered to clear it, which is how a model with a reported area under the curve in the high seventies sails through procurement with no one having stated, in advance, what false-negative rate the screened population can tolerate. The pre-committed tolerance is a commitment, not a result, and cannot be quietly relaxed to fit a disappointing number. Aviation reached the same conclusion from the opposite end: DO-178C sets verification rigor by the severity of what happens when software is wrong, not by how often it is correct. Two regulated industries governing the same object — the harm of a failure and the rate at which it escapes detection — under two names.

Section 03

The Epic Sepsis Model: Observed Versus Assumed

The Epic Sepsis Model is the cautionary case for the whole argument, because it failed in exactly the shape the harm-weighted view predicts and the accuracy view conceals. The model is a proprietary sepsis-prediction tool embedded in the Epic electronic health record and, by the time it was independently examined, it was deployed across hundreds of U.S. hospitals on the strength of a vendor-reported area under the curve in the range of 0.76 to 0.83. That number is the assumed performance — respectable discrimination, the kind of figure that clears a procurement review and goes live.

Then an external validation, run over tens of thousands of hospitalized patients, reported the observed performance, and it was a different model. The area under the curve was 0.63, far below the vendor's claim. Sensitivity was 33 percent at the alerting threshold the hospital used — the model failed to identify roughly two-thirds of the patients who actually developed sepsis. Each of those misses is a false negative on a condition where delayed treatment kills by the hour, and each was delivered with no flag: the model simply did not alert, and a clinician relying on it received silence where a warning was due. This is Hazardously Misleading Information in clinical form — an output below its required integrity, presented as if the absence of an alert meant the absence of risk.

A reported c-statistic of 0.76, a real one of 0.63, two-thirds of sepsis cases missed in silence — the gap between assumed and observed is the harm profile no one measured before the model went live at scale.

The lesson of the external validation

The other half of the harm profile was alert burden. The model generated alerts on a fifth of all hospitalized patients — a high-volume, low-yield stream that produces alert fatigue, the documented failure mode in which clinicians, conditioned by a torrent of false alarms, begin dismissing alerts reflexively, including the true ones. The phenomenon is not anecdotal; meta-analyses of drug-interaction alerts find physicians override roughly nine in ten. The Epic model was simultaneously too quiet on the patients who were sick and too loud on the patients who were not, and both failures are invisible to the single figure that justified the deployment. The structural shape is the lesson: a high reported accuracy figure, an unmeasured harm profile in the deployment population, and no abstention — the model answered on every patient and carried no mechanism to decline when its evidence was thin. Confidently wrong, at scale, with no warning.

Section 04

Govern Integrity, Not Accuracy

The corrective is to change the governing variable. Stop asking how often is the model right and start asking how often is it wrong without a warning, and what does each such error cost. The first quantity is accuracy. The second is integrity, harm-weighted — the quantity ISO 14971, DO-178C, and bank supervision all already govern, each under its own name. The KellerAI LLM-Agent Assurance Standard supplies the operational vocabulary for porting that variable into an AI system, and the mapping to clinical decision support is direct.

The first move is to set the escape tolerance — the acceptable rate of unflagged wrong outputs — by harm severity, per tier, rather than to one global number. A contraindication warning carries a different severity than a formatting suggestion, and the tolerable rate of a silent error must scale to that severity the way DO-178C scales rigor to its assurance level. An irreversible, high-consequence recommendation — a dosing instruction, a stop-treatment call — sits at the top of the lattice, where the tolerable rate of confident, unflagged error approaches zero and the default in the face of thin evidence is to abstain and escalate to a clinician.

The second move is to sort each claim by whether it has an exact oracle. Bucket A is the deterministically checkable class, and a great deal of clinical safety lives here while being left to a language model that should never have been asked: dose arithmetic is checkable against the drug's mg-per-kg bounds and the patient's weight; a contraindication or drug-interaction claim is checkable against a curated database; an allergy conflict is a lookup. Where an exact oracle exists the escape rate is driven toward zero by the verifier, not estimated probabilistically — and that these checks work is a measured result: computerized order-entry systems with dose-range, drug-allergy, and interaction screening have cut prescribing errors by margins reported on the order of fifty to eighty percent. Bucket B is the open-world class where no exact oracle exists — a differential diagnosis, a prognosis, a judgment call. There the discipline is to bound a maximum acceptable escape rate, measure it by backtesting on a held-out, adversarially-stressed set with a stated confidence interval, and route above-tolerance or out-of-envelope cases to an independent verifier or to abstention.

Govern the rate at which the model is wrong without a warning, weighted by what being wrong costs. The checkable, check deterministically; the unprovable, bound and backtest; the thin, abstain. That is integrity, and accuracy was never it.

The variable that changes everything

The third move is independent, qualified verification — no self-grading. A model's own confidence is not a check on its own claim; the two share the same blind spots and fail together. The assurance standard expresses these moves as cross-cutting obligations, and four carry the weight for the clinic. OBL-RES-001 — reversibility/rollback scales the rollback horizon to severity, pushing irreversible actions — the medicine administered, the treatment stopped — to the top of the lattice where abstention is the default. OBL-IND-001 — independent qualified verification requires a consequential claim pass a verifier the governance layer selects, independent of the model that produced it, before it is acted on — the dose oracle, the interaction database, or the clinician. This is the obligation both cautionary cases violated. OBL-VQ-001 — verifier qualification keeps a second language model from being mistaken for a real check: two models that share training data share blind spots, so a deterministic checker is the strongest verifier precisely because it cannot inherit the model's errors. OBL-TIER-001 — tier derivation sets the consequence tier to the maximum over reversibility, scope, and consequence, defaulting to the highest when any axis is unknown — the structural inverse of the accuracy view, which assumes a model is fine until proven wrong.

Section 05

The Posture Before Acting

Before deploying a clinical AI, the operator's posture should be set by harm, not by accuracy, and it collapses to four commitments. Budget failures by harm severity. Estimate the severity of each class of error the way ISO 14971 demands, set the tolerable rate of an unflagged wrong output to that severity per tier, and let the irreversible, high-consequence claims carry the tightest budget and the abstention default. Verify the checkable deterministically. Where an exact oracle exists — dose arithmetic, contraindication and interaction lookups, allergy conflicts — use it, drive that escape rate toward zero, and never hand a deterministically checkable safety claim to a model that can only estimate. Abstain when the evidence is thin. Build the system to decline and escalate to a clinician rather than to answer confidently out of distribution — the missing move in both the sepsis model and the oncology recommender that preceded it.

The fourth commitment ties the others together and is the title of this brief. Measure integrity, not accuracy. The number that governs a clinical AI is not the fraction of cases it gets right. It is the bounded, harm-weighted rate at which it is wrong without a warning, measured on the distribution it will deploy on, verified by a party independent of the model, and monitored as that distribution moves. An accuracy figure can be high while that rate is lethal, which is precisely what the Epic Sepsis Model's gap between a reported 0.76 and an observed 0.63 — two-thirds of sepsis cases missed in silence — demonstrated at the scale of hundreds of hospitals.

Medicine is the sharpest test of the whole discipline because it is the domain where being confidently wrong is lethal and irreversible — where the missed cancer cannot be un-missed and the fatal recommendation cannot be recalled. That is exactly why it is the right place to insist on the harder variable. The wrong question is how accurate is the model. The right question is how often is it wrong without a warning, what does each such error cost, who checked it, and could the patient have been harmed before anyone noticed.

This brief is the short version. Risk Is Measured in Harm, Not Accuracy — in depth carries the full argument: the sensitivity/specificity/PPV/NPV decomposition, why AUROC hides the harm-bearing tail, the Epic Sepsis and Watson for Oncology failures in detail, the four LAAS obligations mapped to the clinic, the honest limits, and the citations. It is the second article in a three-part stack on clinical AI governance. The first, Intended Use Is the Envelope , establishes the FDA principle that a model's indication is its operating boundary — the scope inside which any integrity claim is even meaningful. The third, The Clinician Is the Diversion Airport , takes up human oversight and post-market surveillance — the reachable fallback and the ongoing monitoring this brief's posture insists on.

End of brief

↑ Back to top