Observable by Design, or Liable by Default: What the EU AI Act Actually Requires of Your Audit Trail

KellerAI

Section 01

Abstract

The European Union's Artificial Intelligence Act — Regulation (EU) 2024/1689 — reaches a hard date on 2 August 2026. That date is widely misreported as “full enforcement.” It is not. Under the Digital Omnibus simplification package, on which the co-legislators reached a provisional agreement on 7 May 2026, the heaviest obligations for the high-risk systems listed in Annex III are deferred to 2 December 2027, and the Article 50(2) machine-readable marking duty moves to 2 December 2026. ¹ What activates on 2 August 2026 is narrower but real: the Article 50(1) transparency duties stay live, the penalty regime attaches, and the national market-surveillance authorities that enforce the Act are due to be operational. ²

The central claim of this paper is that the Act's mandatory audit artifacts — the Annex IV technical file, Article 72 post-market monitoring, the Article 50 transparency records — are the same observable-by-design practices the KellerAI corpus has argued are simply correct engineering, since 2025, in papers like Observability Theater and The Audit You Can Audit . The Act does not introduce a new discipline. It puts a deadline, and a penalty, on the discipline that engineering teams were already supposed to have.

The numbers say most teams do not have it. One enterprise readiness survey put the share of organizations unprepared for the Act at roughly 78 percent; analyst estimates of organizations with mature AI governance sit near 12 percent, and only about a quarter report having begun concrete compliance activities. ³⁴ The compliance gap is, almost entirely, a documentation gap: 78 percent with no system inventory, 74 percent with no accountable owner, 61 percent with no process for producing the Annex IV file. ³ Those are not legal failings. They are the absence of engineering evidence — the exact absence the corpus has been naming.

This paper maps the Act's obligations to the artifacts an observable system emits, section by section, and is explicit about where the evidence base is thin: the corpus is single-platform, the Digital Omnibus is not yet adopted in the Official Journal, Colorado's successor statute is unsigned, and several harmonized standards are still in draft.

A permanently-empty audit field passes schema validation and fails regulatory review. These are the same problem.

The thesis

Section 02

What Actually Applies on 2 August 2026

Precision about the date matters, because the loose phrase “the AI Act goes live in August” invites both panic and complacency in equal measure, and both are wrong. The Act entered into force on 1 August 2024 with a staged application schedule. The prohibited-practices and AI-literacy provisions applied from 2 February 2025. The general- purpose AI model obligations applied from 2 August 2025. And 2 August 2026 is the next major milestone — the point at which the bulk of the Act becomes applicable, the governance and penalty architecture attaches, and Member States' market-surveillance authorities are required to be designated and operational. ²

The Digital Omnibus rearranges the back half of that schedule. The co-legislators reached a provisional political agreement on 7 May 2026 to simplify and stagger the heaviest obligations. Under that agreement, the obligations attaching to the high-risk systems enumerated in Annex III are pushed from 2 August 2026 to 2 December 2027, and the Article 50(2) requirement that providers of generative systems mark synthetic output in a machine-readable form moves to 2 December 2026. ¹ Crucially, the Article 50(1) duties — that a person interacting with an AI system be informed they are doing so, and that deepfakes and synthetic media be disclosed under Article 50(4) — were not deferred. They remain due on 2 August 2026. ⁵

The Commission has been building the operational scaffolding for these transparency duties in parallel. A draft of the Article 50 guidelines was published on 8 May 2026, with the public consultation closing on 3 June 2026. ⁶ The guidelines are the interpretive layer that tells a provider what, concretely, a compliant disclosure looks like — and the fact that they were still in consultation weeks before the application date is itself a signal that organizations cannot wait for perfect clarity before starting the underlying engineering work.

2.1 The compliance gap is a documentation gap

The readiness data is uncomfortable and worth reading literally. Across surveys, roughly 78 percent of organizations describe themselves as unprepared for the Act's requirements. ³ Only about 26 percent report having started concrete compliance activities, and only about 12 percent are assessed as having mature AI governance in place. ⁴ Treat these as survey figures with the usual caveats — different samples, different definitions of “mature” — but the direction is consistent across independent sources.

What is striking is the shape of the unpreparedness. The same readiness report breaks the gap down into its components: 78 percent of organizations have no inventory of the AI systems they operate, 74 percent have assigned no accountable owner for AI compliance, and 61 percent have no defined process for producing the Annex IV technical documentation a high-risk system requires. ³ None of these is a legal sophistication problem. You do not need a regulatory specialist to know what systems you run, who owns them, or how you would document one. The gap is operational: the artifacts do not exist because the engineering practices that would produce them were never adopted.

You do not need a regulatory specialist to know what systems you run, who owns them, or how you would document one. The artifacts do not exist because the engineering practices that would produce them were never adopted.

Section 03

The Artifact Stack

The Act's documentation requirements are not a single form. They are a stack of artifacts, each governed by its own article, each describing a different facet of the same underlying demand: show, with durable evidence, that you understand and control the system you deployed. Four pieces of the stack carry most of the weight for an engineering team.

The Annex IV technical file (Article 11). Article 11 requires that, before a high-risk system is placed on the market, its provider draw up technical documentation conforming to Annex IV and keep it current. Annex IV is a nine-part structure. ⁷⁸ Read as an engineer rather than a lawyer, several of its parts are familiar. Annex IV §2 demands a description of the development process — the methods, the design choices, the data, the validation procedures. That is a re-runnable evidence chain: the eval trail you can reproduce, the discipline argued for in The Audit You Can Audit . ²⁰ Annex IV §3 demands documentation of monitoring, functioning, and control — accuracy metrics, and the human-oversight measures built into the system. That is a behavioral fingerprint plus a record of what a human actually reviewed. Annex IV §6 demands a description of changes made through the system's lifecycle, including pre-determined changes. A model swap is a change. And Annex IV §9 demands the post-market monitoring plan. ⁸

Post-market monitoring (Article 72). Article 72 requires providers of high-risk systems to establish and document a post-market monitoring system, proportionate to the nature and risks of the system, that actively and systematically collects, documents, and analyzes relevant data on the system's performance throughout its lifetime. The monitoring is to be governed by a plan that forms part of the technical documentation, and it feeds the serious- incident reporting duty under Article 73. ⁹ The word that matters is “throughout.” This is not a point-in-time conformity check filed once and forgotten; it is a continuous obligation that ages with the system.

The fundamental-rights impact assessment (Article 27). For certain deployers — public bodies, and private operators providing services such as credit scoring and life or health insurance pricing — Article 27 requires a fundamental-rights impact assessment before putting a high-risk system into use, and an update when any of its underlying elements changes through a substantial modification. ¹⁰ The FRIA is a pre-deployment rights document, and it too must be kept current as the system evolves.

Substantial modification (Article 25). Article 25 governs the most consequential change an operator can make. A party who puts its name on a high-risk system, makes a substantial modification to one already on the market while keeping it high-risk, or modifies the intended purpose of a non-high-risk system such that it becomes high-risk, assumes the obligations of a provider. ¹¹ The legal weight here is heavy: a substantial modification can transfer the full provider burden to the party who made it. Whether a given model upgrade rises to a substantial modification is a fact-specific question the Act does not answer with a bright line — which is precisely why the evidence trail that characterizes the change is load-bearing.

Show, with durable evidence, that you understand and control the system you deployed. Every article in the stack is a different facet of that one demand.

The frame

Section 04

The Evidence Gap

Here is where the corpus and the Act meet on a single concrete object. The Observability Theater paper documents a production AI system whose decision-log schema includes an array named obligations_referenced — the field meant to record which governing rules, policies, or duties were consulted when the system acted. In every observed log line, that array is empty. The field is present, the value is [], and the structure passes every schema check because an empty array is a structurally valid array. ¹²

Read that field against Article 72 and the equivalence is exact. Article 72 requires a monitoring system that actively and systematically collects and analyzes performance data across the system's lifetime. ⁹ The obligations_referenced array is exactly the kind of record such a system is supposed to populate. Structurally it is present; substantively it is empty. The monitoring framework runs, the dashboards are green, the schema validates — and there is nothing in the field for an auditor to read. A post-market monitoring system that emits structurally-valid, substantively-empty telemetry does not satisfy Article 72; it performs the appearance of satisfying it.

The logging-quality bar is not novel either. NIST's guide to security log management, SP 800-92, sets the standard at logs carrying “sufficient detail for after-the-fact investigation.” ¹³ An always-empty array meets none of that bar — there is no detail to investigate. And the same gap reappears in Annex IV §3, which demands documentation of the human-oversight measures and of how the system is monitored. ⁸ A field intended to capture which obligations a human or the system weighed, left permanently empty, is the absence of exactly the oversight evidence Annex IV §3 requires.

The deeper point is that none of these is a logging-infrastructure failure. The infrastructure works. The schema is well-formed. The framework emits. What fails is the meaning: the emitted record carries no information an auditor — or, for that matter, an incident responder — can use. This is the precise sense in which the Act's requirement and the engineering requirement are the same requirement.

The Act does not require a logging framework. It requires what the logging framework emits to mean something.

The convergence

Section 05

The Model-Upgrade Test: Where Article 25 and Article 72 Converge

The cleanest demonstration that the Act describes ordinary engineering evidence is the model upgrade. A vendor ships a new model version; an operator points its production system at the new model id. To the operator it can feel like a configuration change. Under the Act it is potentially two things at once: a candidate substantial modification under Article 25, and a performance event the post-market monitoring system under Article 72 must capture.

The corpus has argued at length that a vendor model upgrade is a controlled change, not a drop-in — that even a release described by its maker as “a modest but tangible improvement” can change behavior across every system that runs on it, and that the discipline which survives an auditor is one that produces the operator's own before-and-after evidence rather than adopting the vendor's benchmark numbers. ¹⁴ Annex IV §6, which requires a description of the changes made through the system's lifecycle including pre-determined changes, is the documentary home for exactly that evidence. ⁸ A pre-determined-change description that anticipates model updates, plus an eval trail showing what was tested when the update landed, is what lets an operator characterize a swap well enough to make the Article 25 determination at all.

That determination has no bright line, and this paper does not pretend to draw one. Whether a particular model swap is a “substantial modification” that transfers provider obligations is a fact-specific legal question. ¹¹ But the question is unanswerable without evidence. An operator who swapped the model with no before-and-after eval trail cannot say whether the change was substantial, because it has no record of what changed. An operator who ran a blinded comparison on its own task distribution can at least characterize the delta — and that characterization is the input to the legal call.

The convergence with older model-risk regimes is worth naming because it shows the requirement is not an EU novelty. SR 11-7, the U.S. Federal Reserve and OCC guidance on model risk management, requires not a one-time validation but ongoing monitoring — a continuous discipline of observing model behavior against documented expectations. ¹⁵ Article 72's “throughout their lifetime” and SR 11-7's “ongoing monitoring” are the same requirement expressed in two regulatory vocabularies. A team that already satisfies SR 11-7 for a banking model is most of the way to satisfying Article 72 for the same system — not because it studied the Act, but because the underlying engineering discipline is identical.

Section 06

Annex III Scope and the Direction of Travel

A reasonable first question is whether any of this applies to you, and the honest answer is that many operators do not yet know. Annex III enumerates eight categories of high-risk use: biometrics; critical infrastructure; education and vocational training; employment and worker management; access to essential private and public services, including creditworthiness assessment; law enforcement; migration, asylum, and border control; and the administration of justice and democratic processes. ¹⁶ The Commission published draft guidelines on the Article 6 classification rules on 19 May 2026 to help operators determine whether their system falls in scope. ¹⁷ The need for those guidelines is itself telling: one analyst estimate found that around 43 percent of mid-size EU firms did not know whether their AI systems would be classified as high-risk. ⁴

The registration mechanics matter for the audit-trail argument. Article 6(3) preserves a route for systems that fall in an Annex III category but do not pose a significant risk, and the self-assessment that supports such a determination is recorded in the EU database under Annex VIII. A self-assessment that a system is not high-risk is itself an artifact an authority can later inspect — another place where the obligation reduces to durable, reviewable documentation.

Beyond the EU, the same demand is appearing in other jurisdictions, though unevenly, and it should be read as direction of travel rather than a symmetrical mirror. In the United States, Colorado's Senate Bill 24-205 — the first comprehensive U.S. state AI statute — had its effective date stayed on 27 April 2026, and the legislature passed a narrower successor, Senate Bill 189, on 7–9 May 2026. As of this writing SB 189 awaits the governor's signature, carries a January 2027 effective date, and has been trimmed largely to notice and transparency obligations. ¹⁸ California's SB 1120, in force since January 2025, requires that a licensed physician make the final determination on health-insurance coverage decisions rather than an algorithm. ¹⁹

The U.S. examples are deliberately framed as illustrative. Their scope differs from the EU regime, several are unsettled, and one is unsigned. The point is not that the obligations are identical across borders; it is that the underlying evidence demand recurs. Show the system disclosed itself. Show what a human reviewed. Show what changed and what you tested when it did. Every one of those is an engineering artifact, and the same artifact answers the question regardless of which statute is asking.

The obligations differ across borders. The evidence demand does not. The same artifact answers the question regardless of which statute is asking.

Section 07

The Mapping Table

This is the paper's core contribution: a row-by-row mapping from each obligation, to the evidence artifact it demands, to the place the KellerAI corpus already named the gap that artifact fills. Read the right-hand column on its own and it reads as a description of good engineering. That is the argument in one table.

Obligation

What the clause demands

The engineering artifact

Annex IV §2 (Art. 11)

Description of the development process — methods, design, validation.

A re-runnable evidence chain: the eval trail an operator can reproduce, not the vendor benchmark it adopted. (the-audit-you-can-audit)

Annex IV §3

Monitoring, accuracy, and human-oversight measures.

A behavioral fingerprint plus oversight logs that record what a human actually reviewed. The empty obligations_referenced field is this gap. (observability-theater)

Annex IV §6

Changes through the lifecycle, including pre-determined changes.

A model-swap change record: a vendor upgrade is a change to the system, and the change needs documented before/after evidence. (what-changes)

Article 72

Post-market monitoring, continuous across the system’s lifetime.

Continuous, semantically meaningful logs — the non-empty version of obligations_referenced. (observability-theater)

Article 25

Substantial modification — when a change transfers provider duties.

Delta evidence: the blinded before/after eval trail that lets an operator characterize whether a swap is substantial. (what-changes)

Article 50(1)

Transparency — inform the person they are interacting with AI.

An interaction-flag record: proof the system disclosed itself, emitted and retained rather than assumed. (engineering gap)

Article 27

Fundamental-rights impact assessment for in-scope deployers.

A pre-deployment rights document, updated on substantial modification — another durable, reviewable artifact. (engineering gap)

The table is not a compliance checklist to be filled in after the system ships. It is a claim about what the system should have been emitting all along. An operator who built the eval trail, the meaningful telemetry, the change record, and the disclosure log because those are correct engineering practices arrives at the Annex IV file as a formatting exercise over evidence it already has. An operator who did not arrives at it as net-new work against a deadline.

Annex IV is not a compliance template. It is a description of what good engineering evidence looks like — written by lawyers, with penalties attached.

The core claim

Section 08

Honest Limits

A paper arguing that compliance is engineering correctness owes the reader the same evidentiary honesty it demands of an audit trail. Here is what this argument cannot claim.

The regulatory dates are a moving target, and several rest on a provisional agreement rather than adopted law. The Digital Omnibus deferrals — Annex III high-risk obligations to 2 December 2027, Article 50(2) watermarking to 2 December 2026 — come from the provisional co-legislator agreement of 7 May 2026, which had not been formally adopted in the Official Journal as of 2026-05-30. ¹ Treat every deferred deadline in this paper as a planning baseline that could shift if the final adopted text differs from the provisional agreement. The 2 August 2026 application date and the Article 50(1) transparency duties are the firm anchors; the deferrals are the provisional layer.

The Annex IV correspondences are an engineering reading, not a legal opinion. This paper maps clauses to artifacts to make an engineering-correctness argument; it does not constitute legal advice, and the determination of whether a specific system is high-risk, or whether a specific model swap is a substantial modification under Article 25, is a fact-specific legal question with no bright line that a competent adviser must make on the facts. ¹¹

The U.S. parallels are illustrative and partly unsettled. Colorado's SB 189 was unsigned as of this writing, its scope is narrower than the stayed SB 24-205, and its January 2027 effective date is contingent on signature. ¹⁸ The U.S. examples show direction of travel; they are not a jurisdiction-by-jurisdiction compliance map, and nothing here should be read as asserting symmetry between the EU regime and any U.S. statute.

The conformity-evidence target is itself still forming. The harmonized standards that will operationalize many of the Act's requirements — the standards against which conformity is presumed — were still in drafting as of this writing. An organization building its Annex IV evidence today is building against a target that may sharpen, and the specific format conformity assessors expect could shift as the standards finalize. The argument of this paper is robust to that shift, because re-runnable evaluation, meaningful telemetry, and change records are demanded under whatever format the standards settle on — but the exact packaging is not yet fixed.

And the corpus evidence is single-platform. The obligations_referenced finding, the model- upgrade analysis, and the observability arguments are drawn from one platform's codebase and telemetry. ¹²¹⁴ The mechanism the corpus names — a structurally-present, substantively-empty audit field — is a general failure pattern, but the specific instances are from a single source, and the generalization to other platforms is an argument from mechanism, not from a broad empirical sample. We flag this so the reader weighs the claim accordingly.

For the leadership-level version of this argument — the stakes, the one story, and the three questions that tell you where you stand — read the companion brief, The Audit Field Was Always Empty . ²¹

Not that the Act is simple, or settled, or that one platform's evidence proves the general case — but that the engineering a team should already be doing is the engineering the law will ask it to show.

The honest promise