Intended Use Is the Envelope: FDA SaMD Scope and the Bounded Autonomy Indication

KellerAI

Section 01

The Model Is Never the Thing That Is Cleared

The unit of FDA authorization is not the software. It is the indication for use. Every device that reaches market through the agency carries an indications statement, recorded on a standard form — Form FDA 3881 — that names what the device is for, in whom, and under what conditions. 6 The statement is short, deliberately, because it is load-bearing: it is the legal boundary of the authorization, and the law that gives it force defines a device's intended uses as the objective intent of the parties responsible for its labeling, inferred from the claims they make and the circumstances of the device's distribution. 5 That regulatory definition lives at 21 CFR 801.4, and its consequence is exact: what a device is authorized to do is what its labeling claims it does, and nothing else. 5 A use the labeling does not claim is not an authorized use. It is off-label, and off-label is a term of art for outside the envelope the regulator examined.

This matters more for software than for a scalpel, because software invites the opposite intuition. A scalpel does one thing; no one imagines that clearing it to incise tissue also clears it to perform anesthesia. But a trained model is general by construction — the same weights that read a retinal image can be prompted to read a chest film, draft a discharge note, or triage a symptom list — and the generality seduces the operator into reading a clearance as a competence certificate. The FDA's framework for software as a medical device refuses that reading at the root. The International Medical Device Regulators Forum, whose SaMD working group the FDA co-chaired and whose definitions the agency adopted, defines software as a medical device by the medical purpose it serves, not by the code it contains. 1 4 A change of purpose is a change of device. The same binary, pointed at a different decision, is a different regulated article requiring its own evaluation.

The framework goes further and makes the purpose granular. The IMDRF risk-categorization scheme sorts a SaMD into one of four categories by crossing two axes: the significance of the information the software provides to the healthcare decision — does it inform, drive, or treat or diagnose — against the state of the healthcare situation, from non-serious to critical. 2 A system that diagnoses a critical condition sits in the highest category and draws the heaviest scrutiny; the same algorithmic core, repurposed to merely inform a non-serious decision, sits in the lowest. The category is a property of the use, not of the model. Move the model to a different decision and you have moved it to a different category, and the evaluation that licensed it in one does not transfer to the other. The FDA's own clinical-evaluation guidance for SaMD adopts this categorical scheme directly, tying the depth of clinical evidence a system must produce to the IMDRF category its intended use falls in. 22

A clearance does not certify that a model is good. It authorizes one use, on one population, for one decision — and renders every other use unvalidated by construction. The envelope is the authorization; the model is merely what runs inside it.

The inversion

Hold onto the construction, because the rest of the paper turns on it. The regulator did not test the model and approve a generous region around its competence. It tested a use and drew a tight boundary around exactly what it tested. Inside the boundary, there is evidence — a trial, a population, a measured performance. Outside it, there is no evidence, and the absence is not an oversight to be filled in by the operator's judgment. It is the definition of off-label. The honest description of an off-label autonomous action is therefore not "probably fine, lightly tested" but "unvalidated, by the literal construction of the authorization that the action claims to rely on."

Section 02

The Line Between Assistive and Autonomous Is Statutory

Most regulatory regimes that touch AI draw the assistive / autonomous distinction softly, as a matter of risk posture or best practice. Medicine drew it in statute. Section 3060 of the 21st Century Cures Act, enacted in December 2016, amended the Federal Food, Drug, and Cosmetic Act to add section 520(o), which carves certain clinical decision support software out of the definition of a medical device entirely. 13 The carve-out is not arbitrary. Its operative criterion — at FD&C Act section 520(o)(1)(E), elaborated in the FDA's 2022 Clinical Decision Support Software guidance — is whether the software allows the clinician to independently review the basis for its recommendation, so that the clinician, not the software, makes the call. 12 13 Software that surfaces evidence for a human to weigh is not a device. Software that delivers a conclusion the human is expected to act on without being able to reconstruct and overturn its reasoning is a device, and falls under the full weight of FDA oversight.

Read that boundary slowly, because it is the cleanest statement in any regulatory corpus of the difference between an AI that assists and an AI that decides. The dividing line is not the model's accuracy, nor its architecture, nor whether it uses machine learning. It is whether a competent human remains the decision-maker, able to inspect and reject the output. The moment the software's output is the decision — because no human can or does independently re-derive it — the regulatory regime changes character entirely. Assistance is governed lightly or not at all; autonomy is governed as a medical device. The statute encodes the same intuition that runs through every mature assurance regime: the consequence of an error scales with how directly the system's output reaches an irreversible action, and the governance must scale with it. 15

This is the spine that connects medicine to the rest of the autonomy problem. A companion KellerAI paper argues that aviation and banking independently converged on bounding integrity — error delivered without warning — rather than raw accuracy, because a system that is usually right but silently wrong when it matters is not a safe system. 16 Medicine's statutory line is the same principle rendered as law: an assistive system can be silently wrong and a human catches it, so it is governed for accuracy and transparency; an autonomous system that is silently wrong reaches the patient, so it is governed for integrity, with the full device apparatus behind it. The Cures Act did not invent this distinction. It legislated the one that aviation built into its command-monitor architectures and banking built into effective challenge — independently, because the problem forces it.

The practical upshot for anyone deploying an AI agent is direct and uncomfortable. The question that determines your governance burden is not how good is the model but does a competent human remain the decision-maker. If the human can and does independently review and override every consequential output, you are building an assistive system, and the lighter regime applies. The instant you remove that human from the loop — because throughput demands it, or because the model is good enough that review has decayed into rubber-stamping — you have crossed a statutory line that medicine drew explicitly, and you now owe the obligations of an autonomous system whether or not you have acknowledged the crossing.

Section 03

IDx-DR: Autonomy Licensed Precisely Because the Envelope Was Narrow

On April 11, 2018, the FDA permitted marketing of IDx-DR, the first medical device authorized to use artificial intelligence to autonomously detect a condition — to deliver a diagnostic result that does not require a specialist to interpret the underlying image. 7 The authorization came through the De Novo pathway, request number DEN180001, which the agency uses to classify a genuinely novel low-to-moderate-risk device for which no predicate exists. 8 It is the landmark case for autonomous diagnostic AI, and the temptation is to read it as a triumph of the model. The correct reading is the opposite: the autonomy was licensed not because the model was powerful but because the envelope was narrow, and the narrowness is visible in every clause of the indication.

The cleared indication is a single sentence doing an enormous amount of work. IDx-DR is indicated for use by health care providers to automatically detect more than mild diabetic retinopathy — a specific severity threshold, abbreviated mtmDR, not retinopathy in general — in adults diagnosed with diabetes who have not previously been diagnosed with diabetic retinopathy, a precisely bounded population, and it is indicated for use with the Topcon NW400, one named fundus camera. 8 Each italicized clause is a wall of the envelope. One disease, at one severity threshold, in one population, on one camera. The system was not cleared to read retinas. It was cleared to answer a single binary question — is there more-than-mild diabetic retinopathy present — for a specific kind of patient, using a specific instrument, in the primary-care setting where that question is screened. 7

The evidence that licensed even this narrow autonomy was produced inside exactly those walls. The pivotal trial, published in npj Digital Medicine in 2018, enrolled patients in primary-care clinics — not eye clinics — and ran the system against the bounded indication, reporting a sensitivity of 87.2% and a specificity of 90.7% for the mtmDR decision. 9 Those numbers are a property of the use, not the model: they were measured on that population, that camera, that decision. They say nothing whatever about how the same software would perform reading a retina from a different camera, or screening a child, or detecting a different pathology — and the authorization makes no claim that it would, because it was never tested and never authorized to. The trial result is the evidence inside the envelope; outside the envelope there is no trial, and therefore no clearance.

IDx-DR was permitted to act without a human reading the image not despite its narrow indication but because of it. The walls of the envelope — one disease, one threshold, one population, one camera — are what made the unattended decision defensible. Widen any wall and the evidence that licensed the autonomy no longer reaches.

Why the autonomy was grantable

The most instructive feature of the design is what the system does at the edge of its envelope, because it is the part that makes the autonomy safe rather than reckless. IDx-DR does not always answer. When it detects more-than-mild retinopathy it does not treat or prescribe; it refers the patient to an eye care professional — it hands the consequential decision up to a human specialist. 7 When the image quality is insufficient for it to be confident, it does not guess; it declines to return a screening result and routes the patient to imaging it can stand behind. 7 The off-envelope behavior is defined. The system knows two things it must do when it is at or beyond the boundary of what it was validated for — refer to ophthalmology, or refuse — and both are forms of the same move: abstain from the autonomous action and escalate to a human. The autonomy and the abstention are not in tension. The defined abstention is the precondition of the autonomy.

The regulatory machinery cemented the narrowness into permanent form. The De Novo created a new device classification — 21 CFR 886.1100, the "retinal diagnostic software device," Class II, product code PIB — with special controls binding any future device of the same generic type to the same kind of bounded indication and clinical evidence. 10 20 The company, then IDx LLC and later Digital Diagnostics, eventually renamed the product LumineticsCore, but the envelope traveled with it: a rename is not a re-indication, and the cleared use remained the cleared use. 11 The lesson is not that IDx-DR is a good model — it may well be — but that the question of whether it is good was never the question the FDA answered. The FDA answered a narrower and more useful question: is this specific decision, on this specific population, with this specific instrument, safe to make without a human in the loop? And it answered yes only because the envelope was drawn tightly enough that the answer could be evidenced.

Section 04

Off-Label Autonomy Is Unvalidated by Definition

The phrase doing the heaviest lifting in this paper is by definition, and it deserves to be defended rather than asserted. The claim is not that off-label autonomous action is risky, or under-tested, or inadvisable — all of which are true but weaker. The claim is that it is unvalidated by the construction of the authorization itself. An FDA clearance is a statement of the form: this use, on this population, for this decision, supported by this evidence, is authorized. The statement's scope is its indication. An action outside the indication is not an action the statement permits-but-warns-about; it is an action the statement does not address, because the evidence that the statement rests on was never gathered there. 5 8 Validation is a relation between a use and the evidence for that use. Where there is no evidence for the use, there is no validation — not weak validation, no validation.

The distinction is easy to lose because off-label use is a familiar and often legitimate clinical practice: a licensed physician may prescribe an approved drug for an indication the label does not carry, exercising professional judgment and accepting professional liability for the departure. But notice what makes that legitimate — a credentialed human is in the loop, deciding deliberately to act beyond the evidence and owning the consequence. The off-label prescription is an assistive case: the label informs, and the clinician decides. Transpose it to an autonomous agent and the human who would own the departure is gone. An agent acting off-label is not a physician exercising judgment beyond the label; it is the label acting on its own, outside the only evidence that ever authorized it, with no one having decided to make the departure. That is the precise sense in which autonomous off-label action is categorically worse than its clinical namesake: it strips out the deliberate, accountable human whose presence is the entire reason off-label use is tolerated at all. 12

This is why the autonomy of IDx-DR was confinable and the autonomy of a general agent is not, absent a deliberately constructed envelope. IDx-DR could be granted unattended operation because its envelope was so narrow that the evidence could fill it: a single decision is small enough to trial. A general-purpose model has no such envelope by default. It will answer any question, read any image, draft any document, take any action its tools permit — and for the overwhelming majority of those uses there is no trial, no measured performance, no evidence at all. To grant such a model autonomy on a use is therefore to grant autonomy off-label by default, in the exact sense the FDA framework makes precise: outside any indication for which evidence exists. The generality that makes the model useful is the same property that makes its autonomy unvalidated everywhere the operator has not done the work of drawing and evidencing an envelope.

The FDA's 2021 AI/ML SaMD Action Plan can be read as the agency confronting exactly this property in machine-learning devices. 3 The plan's recurring concern is that an adaptive model can drift away from the use it was validated for — its behavior changing as it learns, its inputs shifting as deployment populations differ from trial populations — and that the static, point-in-time clearance does not by itself catch the drift. 3 The deeper reading is that the action plan treats the indication as a living boundary the system must be kept inside, not a one-time gate it passes through. A model that has wandered, through drift or distribution shift, to a place where its validated indication no longer describes what it is actually doing has effectively gone off-label without anyone re-indicating it — and is, again by construction, operating unvalidated. The plan's answer to controlled change is the Predetermined Change Control Plan, the mechanism for pre-agreeing how an envelope may widen with evidence rather than by drift, and it is the subject of the third paper in this stack. 14 18

The practical test for an operator is therefore a single question asked of every autonomous action: is this action inside an indication for which I hold evidence? If yes, the autonomy is defensible to the precise extent of that evidence. If no — if the action is on a population, a decision, or an input distribution the system was not validated against — then the action is off-label, and off-label autonomy is not a calculated risk taken with eyes open. It is an unvalidated action wearing the borrowed authority of a clearance that does not reach it. The discipline is to detect the second case and refuse it, which is exactly what IDx-DR's refer-or-refuse behavior does at its own boundary. 7

Section 05

The AI Mapping: Consequence Tier, Bounded by the Observed Effect Surface

The clinical envelope translates into AI agent governance through a concrete unit, and the translation should be mechanical rather than metaphorical. In the LLM-Agent Assurance Standard, every action an agent proposes is assigned a Consequence Tier on a lattice from CT0 to CT4, computed by an out-of-process gate from the action's observed effect surface — its actual reversibility, scope, and consequence — never from the agent's self-report. 15 The effect surface is the software analogue of the indication: it is the bounded region of real-world consequence the action actually touches, and it is what the governance must be scaled to. A read-only query is CT0; a reversible internal write is low; a hard-to-reverse external effect is high; an irreversible, high-consequence action is CT4, the tier that additionally demands human approval and defaults to abstention. 15 The clinical reading is exact: IDx-DR's autonomous mtmDR screening sits at a tier its bounded, evidenced indication can support, while its refer-to-ophthalmology pathway is the system recognizing a higher-consequence decision and routing it to a human — a tier-appropriate escalation at the envelope edge.

The load-bearing rule is what the gate does when the effect surface is undetermined, and it is the precise mechanism that encodes "off-label is unvalidated." The tier derivation is ct = max(reversibility_rank, scope_rank, consequence_rank) for any action with an external effect, with ct = 4 whenever any axis is undetermined. 15 This is OBL-TIER-001 — tier derivation, and its default-to-highest clause is the whole game: an unknown reversibility is treated as irreversible, an unknown scope as public, an unknown consequence as high. The expected tier defaults to CT4 the moment the system cannot establish where it is. 15 Map that onto the clinical frame and it is the digital form of refer-or-refuse: when the agent cannot confirm that an action falls inside an evidenced envelope, it does not assume the action is benign and proceed. It assumes the strictest tier and escalates. The default is not optimism; the default is the highest consequence, exactly as an off-label use is presumed unvalidated rather than presumed fine.

When the agent cannot place an action inside an evidenced envelope, the gate does not give it the benefit of the doubt. It assigns the strictest tier and demands a human. Default-to-highest is the software form of treating the off-label action as unvalidated — because it is.

The default is the discipline

The second load-bearing obligation governs the input side, and it closes the route by which an envelope is most easily breached without anyone noticing. OBL-INP-001 — untrusted-input tier-raising requires that any action driven by untrusted or out-of-distribution input — web content, inbound email, third-party data, a population the system was not calibrated on — is gated at the higher tier or blocked, and that the trace records the input's trust status. 15 The clinical analogue is precise. An out-of-distribution input is the software version of a patient or an image outside the validated population — the chest film handed to the retinal model, the pediatric case handed to the adult-validated screener. IDx-DR's insufficient-image-quality refusal is an out-of-distribution detector wired to abstention: the system recognizes an input it cannot stand behind and declines to act autonomously on it. 7 OBL-INP-001 generalizes that reflex — untrusted or out-of-distribution input raises the tier or blocks the action, and the trace records the trust status so the boundary crossing is auditable rather than silent.

Two more units complete the mapping, and both are concrete. The first is the governance unit itself: not the model, and not even the agent, but the (agent, task-class) pair. An authorization attaches to a pair the way a clearance attaches to a device-plus-indication — this agent, on this class of task, is evidenced to a certain tier; the same agent on a different task-class is a different governance object with its own evidence requirement, exactly as the same model on a different indication is a different regulated device. 1 15 The second is the revocability of the envelope. A clearance is not eternal: drift, a distribution shift, or a post-market signal can pull a use back outside the evidence that supported it, and the SaMD Action Plan's whole concern is keeping the system inside its indication over time rather than at a single moment. 3 The autonomy envelope inherits the same property — it is granted on evidence, held only while the evidence holds, and revoked when the measured behavior drifts outside it. The envelope is provisional by design, which is the through-line connecting this paper to the reliability-accounting argument that autonomy is a budget earned over accumulated experience and contracted on drift. 17

Section 06

The CT Lattice as a Clinical Risk Category

The correspondence between the LAAS Consequence-Tier lattice and the IMDRF risk-categorization scheme is close enough to be worth drawing field by field, because each was built by people solving the same problem in ignorance of the other. The IMDRF scheme grades a SaMD by crossing the significance of its output to the decision — inform, drive, or diagnose-and-treat — against the seriousness of the healthcare situation, yielding four categories of ascending scrutiny. 2 The CT lattice grades an action by crossing reversibility, scope, and consequence, yielding five tiers of ascending rigor. 15 Both refuse to let the grade be a property of the artifact. The IMDRF category is a property of the intended use; the CT is a property of the observed effect surface. Move the software to a more consequential decision and both schemes re-grade it upward — the IMDRF category by the significance-times-seriousness cross, the CT by the max over its three axes.

The clinical reading sharpens what the CT tiers mean in practice. A system whose output merely informs a clinician who independently re-derives the decision is, in Cures Act terms, potentially not a device at all — the assistive case, the lighter regime. 13 A system whose output drives clinical management, that a clinician acts on without fully re-deriving, climbs the IMDRF categories and, in the CT frame, climbs the tiers, because its errors now reach the patient more directly. A system that diagnoses or treats a serious or critical condition autonomously sits at the top of both schemes — the highest IMDRF category, the CT4 tier that demands human approval and defaults to abstention. 2 15 IDx-DR's genius, in this light, was to take a potentially high-category task — diagnosis of a condition that can blind — and confine it to a bounded screening decision with a defined escalation, so that the autonomous portion sat at a tier its evidence could support and the high-consequence portion was routed to a human. 7 The CT lattice gives an operator the same lever: confine the autonomous action to the tier the evidence supports, and escalate the rest.

The deeper point is that the lattice is not a metaphor borrowed from medicine but the same object medicine already uses, applied to the action rather than the device. Both schemes encode the one invariant that every serious assurance regime shares: rigor scales with consequence, the grade is a property of the use, and the boundary of the evidence is the boundary of the authorization. 15 A bank examiner, an FDA reviewer, and an aviation certification authority would each recognize the others' instrument, because the instrument is forced by the problem — bounding the rate at which an undetected error reaches a consequential action — and the problem does not care which vocabulary you reach for. 16

Section 07

One Stack, Three Walls of the Envelope

This paper is the first of three that together build the clinical governance stack, and the three are best read as three walls of the same envelope. This paper establishes the envelope's scope wall: the indication is the boundary, and off-label autonomy is unvalidated by construction. It answers the question what is the system authorized to do, and its discipline is to define the indication, detect departure from it, and refuse outside it.

The second paper, Risk Is Measured in Harm, Not Accuracy , builds the envelope's tolerance wall. 18 A bounded indication still permits some error; the question is how much, and the medical answer is that the tolerance is set by harm, not by accuracy. ISO 14971, the international standard for the application of risk management to medical devices, scales acceptable residual risk to the severity of the harm a failure causes, not to the frequency of failure in the aggregate. 21 A 99%-accurate system whose 1% of errors are catastrophic is not safer than a 95%-accurate system whose errors are recoverable. Where this paper says the envelope must be narrow, the second says the tolerance inside it must be harm-scaled — the same correction banking made when it moved from Value-at-Risk to Expected Shortfall to price the tail rather than the mean. 17

The third paper, The Clinician Is the Diversion Airport , builds the envelope's oversight wall. 19 An envelope is only as good as what happens at its edge and after its grant: the human a system escalates to is the reachable safe-harbour, the post-market surveillance regime is how the envelope is kept honest over time, and the Predetermined Change Control Plan — the FDA's December 2024 mechanism for pre-agreeing how an AI-enabled device's envelope may widen with evidence rather than by drift — is how the envelope is allowed to grow without ever ungoverning the growth. 14 IDx-DR's refer-to-ophthalmology pathway is that third wall in miniature: the clinician is the diversion airport the autonomous system flies to when it reaches the edge of what it was validated to do alone.

The three walls also connect outward to the ETOPS stack that runs through KellerAI's reliability writing, because the autonomy-envelope logic is the same logic an extended-range operation lives by. An aircraft earns the right to fly far from a runway by demonstrating a bounded failure rate, holds that right only while the rate holds, and must always have a reachable diversion airport at the edge of its range. 17 An autonomous clinical AI earns the right to act without a human by demonstrating bounded performance inside a narrow indication, holds that right only while the performance and the population hold, and must always have a reachable human at the edge of its envelope. 7 Earned range, a bounded envelope, and an always-reachable runway are one idea wearing three uniforms — and medicine, where the consequence of getting it wrong is a harmed patient, is the domain that states the idea most starkly.

Section 08

The Operator Posture: Define, Detect, Refuse

The discipline this paper asks of an operator is a posture adopted before the agent acts, not a review conducted after it has, and it has three moves. The first is to define the indication. Before granting an agent any autonomous authority on a task class, write down the envelope: the population of inputs it is validated for, the decisions it is permitted to make unattended, the tier each of those decisions sits at, and the evidence that supports the grant. This is the analogue of the indications statement on Form FDA 3881 — short, explicit, and load-bearing. 6 An agent with no written indication is not a broadly-capable agent; it is an agent operating off-label everywhere, because there is no envelope for any of its actions to be inside.

The second move is to detect off-label and out-of-distribution. The envelope is worthless if the system cannot tell when it is leaving it. The gate must compute, per action, whether the action falls inside an evidenced indication — and when it cannot establish that, OBL-TIER-001 requires it to default to the strictest tier rather than assume the action benign. 15 Untrusted or out-of-distribution input must raise the tier or block the action under OBL-INP-001, with the input's trust status written to the trace so the boundary crossing is auditable. 15 This is IDx-DR's insufficient-image-quality detector generalized: a system that cannot recognize an input it was not validated for will act confidently on exactly the inputs where it is most likely to be silently wrong. 7

The third move is to refuse outside the envelope. Detection without refusal is theater. When an action falls outside the evidenced indication, the system must do what IDx-DR does at its boundary — refer to a human or decline to act — rather than proceed on the borrowed authority of a clearance that does not reach the action. 7 The refusal is not a degradation of the autonomy; it is the condition that makes the autonomy grantable in the first place. IDx-DR was permitted to act without a human because its off-envelope behavior was defined: the regulator could license the narrow autonomous decision precisely because the system was built to refer-or-refuse everywhere else. 7 8

Define the indication; detect off-label and out-of-distribution; refuse outside it. The envelope is the deliverable, not the model. A clearance was never a certificate that the model is good — it was an authorization for one bounded use, and the work is in drawing the boundary and policing it.

The engineering posture before acting

The reframe that opened this paper is also its close. A model is never cleared; a use is. The instinct to read an FDA authorization as a competence certificate for the software is the same instinct that reads an impressive demo as a license for autonomous deployment, and it fails for the same reason: it mistakes a property of one bounded use for a property of the general system. Medicine drew the line cleanly because the cost of blurring it is a harmed patient — confidently, irreversibly wrong. The first autonomous diagnostic AI was licensed not because its model was trusted but because its envelope was narrow and its exits were defined, and that is the entire transferable lesson. 7 Build the envelope. Evidence it. Detect departure from it. Refuse outside it. The clearance is an envelope, not a capability grant — and the envelope is the thing you actually have to build.

Companion brief: Intended Use Is the Envelope . This is Article 1 of a three-article clinical governance stack, read with Risk Is Measured in Harm, Not Accuracy and The Clinician Is the Diversion Airport ; the consequence-tier machinery is specified in The LLM-Agent Assurance Standard .

End of paper

↑ Back to top

Intended Use Is the Envelope

Context

The Finding