The Audit Trail Nobody Acts On: Provenance Records, Telemetry Baselines, and Policy-Enforced Release Gates for Agentic Systems

KellerAI

Section 01

Abstract

The enterprise AI governance conversation of the past two years has produced excellent dashboards. Confidence scores cascade across monitoring interfaces, trace IDs link every agent call to a session record, and compliance teams point at the instrumentation as evidence of seriousness. The problem is that none of it stops anything. An agentic system that logs every output without evaluating that output against a policy is a system whose governance layer is decorative. ¹

This paper documents the gap between logging and enforcement — what we call the dashboard-as-museum problem — and the four-move pattern that closes it. The four moves are: (1) a structured provenance record per run, written for a policy engine rather than for a human reader; (2) a telemetry layer with rolling baselines and anomaly thresholds that produce alerts rather than visuals; (3) an OPA / Rego release gate that reads the run record at emit time and blocks outputs that fail the evidence bar; and (4) a human-in-the-loop path that is a defined exception with a cost, not a bypass valve.

The worked example throughout is KellerAI's cross-discipline research (CDR) plugin, a multi-agent synthesis system whose existing quality controls — structured confidence ratings, multi-phase gates — provide the attach points for the governance layer. CDR is the concrete case; the pattern applies to any agentic system that produces outputs downstream consumers act on.

The regulatory context is explicit. The EU AI Act's Article 12 automatic logging obligations for high-risk systems become enforceable on 2 August 2026. ² Logging satisfies Article 12; a release gate is the evidence of control Article 26 deployer obligations require. ³ The NIST AI Risk Management Framework's Measure and Manage functions ask for operational tracking and third-party risk controls that the telemetry and gate pattern directly address. ⁴ GDPR Article 22 places constraints on automated decisions without human intervention that the human-in-the-loop exception pattern is designed to satisfy. ⁵

For the executive summary of this argument — the gap, the pattern, the honest limits — read the companion brief, The Audit Trail Nobody Acts On . ²⁷

Logging is evidence that you watched. A release gate is evidence that you controlled. The governance frontier is not instrumentation — it is enforcement wired to the output path.

The thesis

Section 02

The Dashboard-as-Museum Problem

An audit trail you only read after an incident is a logfile, not a control. That distinction sounds obvious once stated, but most enterprise AI governance architectures violate it structurally: they are built on the premise that comprehensive logging constitutes governance, when logging is at best the precondition for governance. ¹

The pattern is familiar from financial services, where data lineage and model risk management disciplines have been building audit infrastructure for decades. The NIST AI RMF's Measure function, the ISO/IEC 42001 management-system standard, and the OECD AI Principles all require ongoing monitoring and evaluation — and in each case the standard is written around the gap between data collection and consequential action. ⁴ ⁶ ⁷ Collecting is necessary. Connecting the collection to a decision is the discipline.

In agentic systems the problem compounds. A single-model pipeline has one natural audit surface: the output. A multi-agent pipeline has as many surfaces as it has agent boundaries, and the handoffs between agents — where a low-confidence output from one agent becomes unquestioned input to the next — are precisely the points a post-hoc log cannot reach. The protocol-stack audit gap compounds at agent boundaries: each agent trusts the upstream output, the audit trail shows the full chain after the fact, and no control evaluated the low-confidence handoff in the middle. ²⁶

The governance patterns converging across regulated enterprise platforms are explicit about this. Auditability that captures reasoning pathways, release-board sign-offs that are policy-enforced, and daily performance loops that detect drift against a baseline — these are governance mechanisms that act on outputs, not monitoring tools that observe them. The difference is enforcement, and the instrument of enforcement is a gate wired to the output path.

Most AI governance architectures have wired up logging. They have not wired logging to a decision. That gap is the dashboard-as-museum problem.

The structural failure

Section 03

What the CDR Plugin Already Has

KellerAI's cross-discipline research plugin is not ungoverned. Both of its orchestrator agents embed evidence-quality controls directly in their instruction rails — before any output is produced. A synthesis investigation assigns every claim a structured confidence rating drawn from a closed vocabulary: PROVEN, EMERGING, THEORETICAL, SPECULATIVE, or CONTRADICTED. An opposition research investigation adds VERIFIED, PLAUSIBLE, MISLEADING, FALSE, and UNVERIFIABLE. Both orchestrators run multi-phase gates that must pass before the next phase begins. The controls are real.

The gap is observability and enforcement: none of the above produces a machine-readable record, none is instrumented for metrics collection, none is enforced at emit time by executable policy, and no reasoning pathway persists beyond the ephemeral agent session. When the quality controls exist as prose in instruction rails rather than as durable records evaluated by a policy engine, they are governance by assertion rather than governance by evidence.

The existing primitives are precisely the attach points the four-move pattern needs. The confidence ratings are the payload the run record captures. The phase gates are the events the telemetry instruments. The final output emit is the moment the release gate evaluates. Nothing in the existing agent behavior is removed; the governance layer adds durability, observability, and enforcement to controls that already exist.

Section 04

Move 1: The Provenance Record

The first move is to replace the implicit reasoning trace — which exists only inside the LLM turn and disappears when the session ends — with an explicit, durable, machine-readable run record. This is not a new concept: W3C PROV established a provenance data model for exactly this purpose, and supply-chain attestation frameworks such as SLSA and in-toto have built on it for software artifacts. ⁸ ⁹ ¹⁰ The innovation in the agentic context is applying the same discipline to reasoning artifacts rather than software artifacts.

The run record opens at investigation entry and is populated progressively: at each connector dispatch, at each phase-gate evaluation, at each confidence rating assignment. By the time the report is assembled, the record is complete and becomes the input to the enforcement decision, not a side effect of it. The record is a JSON sidecar with a stable schema and a stable identifier: trace_id, every source with its access date and quality tier, every connector call with latency and result count, every confidence rating with its verbatim justification, every phase-gate pass/fail with timestamp.

The critical design property is that the record is written for a policy engine, not for a human. A human-readable report is assembled later and is a rendering of the run record; the run record itself carries the structured evidence the gate evaluates. This is the same distinction that separates a financial transaction record from a bank statement: one is the authoritative data, the other is a presentation of it.

The Model Context Protocol provides the transport layer through which CDR dispatches its external connectors. ¹¹ ¹² Each MCP tool call is a natural emit point: connector name, query, result count, and latency all belong in the run record. The trace_id propagates across the MCP boundary, linking the connector call record to the investigation record and enabling audit reconstruction across session boundaries. OpenTelemetry provides the semantic conventions for spans and traces that make this propagation standard rather than bespoke. ¹³ ¹⁴

Sigstore-style signing of the completed run record — committing the artifact to a transparency log before it is used as gate input — provides a tamper-evident chain from the investigation evidence to the gate decision. ¹⁵ This is not required for the basic pattern, but it is the natural extension when the run record itself must be auditable by an external party, such as a regulator verifying compliance with EU AI Act Annex IV technical documentation requirements. ¹⁶

A log is written for a human to read after something goes wrong. A run record is written for a policy engine to read at emit time. That distinction is everything.

The record's purpose

Section 05

Move 2: Telemetry with Baselines

The second move addresses the difference between a dashboard that displays current state and a monitoring system that detects drift against a baseline. Most observability implementations build the former and call it the latter. The distinction matters because a dashboard requires a human to notice the spike; a monitoring system with a defined threshold requires only a threshold.

OpenTelemetry's AI agent observability conventions provide the semantic layer: standardized span attributes for model calls, tool invocations, and agent handoffs that make cross-run comparison tractable without bespoke instrumentation per system. ¹³ ¹⁴ The OTLP export protocol routes collected signals to whatever backend the operator runs, making the telemetry layer platform-agnostic.

For CDR, the useful metric set maps directly to the existing connector dispatch and phase-gate infrastructure. Connector call latency and result count are emit-time measurements. Zero-result rate per connector is a derived metric: when a connector consistently returns empty result sets, something is wrong with the connector, the query construction, or the source availability. Confidence distribution across a run is the aggregate of the rating assignments in the run record. Phase-gate pass rate is the proportion of gate evaluations that succeed. Each of these has a natural baseline: the rolling mean over the preceding N investigations.

The alert that matters most in practice is CONNECTOR_DEGRADED: when a connector's zero-result rate rises above a configurable threshold, something is degrading the source. For CDR, this alert surfaces a structural dependency that telemetry can measure but not fix: multiple research domains share a common external search dependency, meaning a single source availability problem silently degrades confidence distributions across several investigation types. The instrumentation makes that visible and measurable. Operational pressure to resolve the structural risk is a consequence of the visibility; the dashboarding alone is not the resolution.

The NIST AI RMF's Measure function and the EU AI Act's Article 72 post-market monitoring obligations both require ongoing performance tracking against defined criteria. ⁴ ¹⁷ The NIST GenAI Profile (AI 600-1) extends this specifically to generative AI systems, adding transparency and reliability dimensions to the measurement surface. ¹⁸ A telemetry layer with rolling baselines and defined alert thresholds is the operational implementation of those obligations — not a substitute for them, but the mechanism by which an abstract obligation becomes a concrete, actionable signal.

Section 06

Move 3: The OPA Release Gate

The third move is the one that converts observability into control. Everything before it is table stakes; this is the gate wired to the output path.

Open Policy Agent (OPA) and its Rego policy language provide the evaluation engine. ¹⁹ ²⁰ The policy-as-code discipline it embodies — first popularized by the CNCF community for Kubernetes admission control — is the right instrument for release gates because policies are plain text, version-controlled, reviewable alongside the code they govern, and updatable without a deployment. ²¹ Changing the evidence bar is a one-line diff, not a code change in a dozen places.

The gate reads the run record at the moment the report is about to be written. It does not evaluate the report; it evaluates the evidence that produced the report. The run record is the input document; the Rego policy is the evaluation; the output is a binary allow/block decision plus a list of failing rules. A run record that does not pass is blocked — not flagged, not annotated, blocked — before the output file is written.

The evidence bar for a CDR synthesis investigation requires a minimum number of independent sources and at least one claim rated PROVEN or EMERGING. For opposition research the bar adds domain coverage requirements. The Rego policy expressing these constraints is compact and readable:

Python

default allow := false

# Synthesis investigations
allow if {
  input.investigation_type == "synthesis"
  count(input.sources) >= data.policy.min_sources
  some r in input.confidence_ratings
  r.rating in {"PROVEN", "EMERGING"}
}

# Opposition investigations
allow if {
  input.investigation_type == "opposition"
  count(input.sources) >= data.policy.min_sources
  covered := {r.domain | r := input.domain_coverage[_]}
  required := {"Technical", "Corporate", "Reputation", "Academic", "Competitive"}
  required == covered & required
  some r in input.confidence_ratings
  r.rating in {"VERIFIED", "PLAUSIBLE"}
}

# Sensitive-data guard
deny contains "pii_masking_record_missing" if {
  input.input_sensitivity == "internal"
  input.pii_masking_record == null
}

flag_for_human_review if {
  not allow
}

The gate's own decision — allow or block, and the failing rules if block — is itself appended to the run record. The audit trail therefore includes enforcement outcomes, not just evidence inputs. A regulator auditing the system can read the run record and verify not only what evidence was gathered but whether the gate passed or the output required a human override.

This is the AI Release Board pattern that regulated enterprise platforms converge on: a structured, policy-enforced sign-off gate before an agent output reaches production. The EU AI Act's Article 26 deployer obligations require that high-risk system operators implement human oversight measures and take appropriate action when a system does not perform as intended. ³ A gate that blocks a failing run record and routes it to a human reviewer is the operational implementation of that obligation. EU AI Act Article 50 transparency requirements similarly oblige deployers to ensure that outputs are identifiable and that users have access to meaningful information about system behavior. ²² The run record plus gate decision is that information.

The policy converts 12 existing prose quality gates into enforceable policy that cannot be bypassed by an agent deciding a gate was “close enough.”

The gate's role

Section 07

Move 4: Human-in-the-Loop as a Defined Exception

When the gate blocks, the orchestrator halts, presents the gap summary to a human reviewer, and requires explicit acknowledgment before the report is written. The presenter is the specific failing rules — “fewer than five independent sources,” “no claim rated above SPECULATIVE” — not the full run record, and not a general alert. The human reviewer sees the precise reason the gate blocked and must decide whether to override it with that reason visible.

The design principle is that the human-in-the-loop path is an exception with a defined cost, not a bypass valve. Every override is logged with timestamp and reviewer identity. The override is appended to the run record alongside the gate decision, so the audit trail carries the full enforcement history: gate evaluated, gate blocked, human override with identity and timestamp. If the same gate fails repeatedly on the same investigation type, the pattern is visible and actionable.

GDPR Article 22 requires that automated decision-making with significant effects on individuals provides for human intervention and the ability to contest decisions. ⁵ The Moody's analysis of human-in-the-loop governance for AI-driven risk and compliance explicitly identifies the distinction between human oversight as a formal control and human oversight as a rubber stamp. ²³ The override pattern here is designed to ensure the former: the human is presented with the specific failure, not a general notification, and their decision is recorded with evidence that it was made consciously.

The EU AI Act's Article 26 deployer obligations require that operators assign appropriate human oversight to high-risk systems and ensure that natural persons to whom oversight is assigned have the competency and authority to act. ³ An override that is logged, timestamped, attributed, and visible in the run record satisfies the “authority to act” requirement in a way that an implicit dismissal of a background alert does not. The goal is not to prevent human judgment — it is to ensure human judgment is applied consciously, not as a default because the system surfaced no friction.

The goal is not to prevent human judgment — it is to ensure human judgment is applied consciously, with the specific failure visible, and the decision recorded.

The exception design

Section 08

The Regulatory Frame

The regulatory framing throughout this section is deliberate: these obligations apply to systems that produce consequential outputs, not to specific technologies. Whether a system uses agentic AI, classical ML, or a rules engine, the obligations to log, evaluate, and control outputs apply if the system is in scope. We flag each mapping as one to verify against current text rather than a settled determination.

The EU AI Act's Article 12 automatic logging requirement for high-risk systems becomes enforceable on 2 August 2026. Annex III defines the high-risk categories; deployers whose systems fall within those categories must implement logging sufficient to enable post-market monitoring. ² The run-record pattern directly addresses this: a structured, durable record per run is logging by design rather than by wrapper. Article 72's post-market monitoring obligations require that deployers collect and review performance data in operation; the telemetry layer with rolling baselines is the operational implementation of that obligation. ¹⁷ Article 73 serious incident reporting requires that deployers report incidents to competent authorities; a run record that captures the evidence state at the time of an anomalous output is the instrument that makes that report possible. ²⁴

Article 10 data governance obligations require that high-risk AI systems use training, validation, and testing data that meets quality criteria; the run-record's capture of source quality tiers and connector provenance extends this discipline from training data to inference-time evidence. ²⁵ The NIST AI RMF's Measure and Manage functions ask for ongoing performance measurement and third-party component risk management; the telemetry layer and the release gate are the operational content of those functions. ⁴ ISO/IEC 42001's AI management system standard similarly requires controls over AI system outputs and ongoing performance review. ⁶

In each regulatory frame, the obligation is the same in substance: demonstrate that the system produces outputs you have evaluated and controlled, not merely outputs you have observed. The four-move pattern is the operational implementation of that demonstration.

Section 09

Honest Limits

A paper that argued for governance while overclaiming its evidence would fail its own discipline. Here is what this paper does not establish.

Evidence-bar calibration is an ongoing problem, not a solved one. A release gate is only as good as its evidence bar. A bar set too low passes weak outputs; a bar set too high creates friction that operators route around with overrides, defeating the control. Calibrating the threshold requires operational data — which is exactly what the telemetry layer produces over time. The initial evidence bar is a hypothesis, not a specification. Operators should expect to revise it based on the override rate and the quality of outputs that pass versus those that are blocked. Treating the initial bar as authoritative is the same mistake as treating the initial confidence thresholds in a credit model as permanent.

Telemetry exposes structural risk; it does not fix it. The CONNECTOR_DEGRADED alert surfaces a shared dependency. It makes the dependency visible and measurable. It does not provision additional connectors, diversify the source pool, or resolve the upstream reliability issue. Operators who see the alert and interpret “we now have a dashboard for this problem” as “we have resolved this problem” have misread the instrumentation. Operational pressure to resolve the structural risk is a consequence of the visibility; the visibility is not the resolution.

PII masking is a hard prerequisite, not a later enhancement. If an agentic system processes internal, client, or proprietary data, a masking layer before any external connector call is a prerequisite for audit-trail work. The run record captures connector queries. A connector query containing unmasked PII is a liability attached to the audit trail, not a control. The masking layer must ship before the run-record layer does, because enabling audit before masking produces an audit trail that itself violates the data protection obligations the audit is meant to demonstrate compliance with.

Connector single-point dependency is an architecture problem outside the governance layer. Multiple research domains sharing a single external search dependency means a single source availability event degrades confidence distributions across several investigation types simultaneously. The governance layer makes this visible; it does not fix the architecture. Resolving the dependency requires provisioning direct connectors for each source domain — infrastructure work that precedes governance work in the operational priority order, even though governance surfaced the need.

The regulatory mappings are inferential. Each regulatory mapping in Section 8 describes an obligation that applies to systems producing consequential outputs; it does not determine that any specific system is in scope. Whether a given deployment falls within EU AI Act Annex III high-risk categories, whether GDPR Article 22 applies to a specific automated output, or whether NIST AI RMF obligations bind a particular organization are fact-specific questions requiring verification against current text and the system's own classification. This paper does not draw those conclusions on a reader's behalf.

Not that governance is easy — it is not — but that the instruments it requires are already available, the attach points already exist, and the gap between logging and enforcement is an engineering choice, not a fundamental barrier.

The honest promise

The Audit Trail Nobody Acts On

Context

The Finding