Abstract
A codebase audit is an act of trust. A team acts on an audit only when it trusts the findings enough to redirect a substantial share of its engineering effort toward what the audit recommends. This paper analyses why most codebase audits never earn that trust, and what an audit must guarantee to deserve it.
We identify four failure modes. Three are long-standing and well documented in the software-engineering literature. Evidence decay: an audit rests on documentation, comments, and design notes that empirical studies show fall out of sync with code as a structural inevitability of how software is built 234910 . Analytical silos: when security, architecture, test coverage, and documentation are each assessed separately, cross-cutting risk — the kind that lives in the seams between concerns — becomes structurally invisible, a limitation the technical-debt literature names directly 6 . The findings cliff: audits terminate in a list of problems and leave the translation into a funded, sequenced roadmap to an unplanned human step that frequently never occurs.
The fourth failure mode is new. As codebase audits are increasingly performed by autonomous large-language-model agents, the audit process itself can silently skip steps, drift from its objective, lose its place in a long context, or terminate early while still emitting a confident, well-formatted report — failure patterns now empirically catalogued in the multi-agent-systems literature 2223 . We call this the unaccountable auditor, and we show it is not a novel problem but the latest instance of one with a two-thousand-year lineage: the oversight of a delegated agent — named by Juvenal, formalised by agency theory in economics 31 , operationalised by auditing theory in accounting 32 , and rediscovered, with new tools, by AI safety research under the name scalable oversight 27282930 .
From this analysis we derive five principles for a trustworthy audit — holism, evidence-freshness grading, roadmap termination, self-accountability, and continuity under load — and describe how Repo-Audit, an implemented codebase-health-assessment skill, is built around them. The paper's central argument is that an audit becomes worth acting on not when its analysis gets sharper, but when its process becomes inspectable: when the audit, like a well-instrumented production system, can be inspected, conformance-checked, and verified rather than merely trusted. Implementation mechanics and empirical evaluation are out of scope by design; no performance results are claimed.
An audit becomes worth acting on not when its analysis gets sharper, but when its process becomes inspectable.
Introduction
The trust problem in codebase auditing
Almost every engineering organisation eventually commissions an assessment of its own code. The occasion varies — a health check before a large refactor, an architecture review after an acquisition, a maturity assessment a customer has demanded, a due-diligence pass before an investment — but the artefact is recognisably the same: a structured report on the state of a codebase, its risks, and what should be done about them. The report is produced. It is often genuinely good. It is presented, praised, and filed. And then, with remarkable regularity, nothing changes.
This is the central failure of codebase auditing, and it is widely misdiagnosed. It is tempting to attribute the inertness of an audit to the quality of its analysis — to assume that a sharper scan, a better severity model, or a more experienced reviewer would have produced a report that landed. Usually it would not. The audit that changed nothing typically found real problems. The failure is not analytical. It is a failure of trust and actionability: the audit did not arrive in a state that a team could, or would, act on.
An audit is a diagnosis. A diagnosis is valuable only insofar as it is trusted enough, and actionable enough, to change behaviour. A codebase audit that a team does not trust — or cannot act on, or that has aged past the point of relevance before it is read — is not a cheap failure. It consumes analyst time, it consumes the attention of senior engineers in review meetings, and, most expensively, it produces a false sense of having addressed the problem. The organisation believes it now knows the state of its code. It has paid for the feeling of being informed without acquiring the capacity to act.
This paper is concerned with the gap between commissioning an audit and anything changing because of it. We argue that the gap has four distinct causes, that three of them are old and one is new, and that closing all four requires treating the audit not as a document but as an accountable process.
Contributions
This paper makes the following contributions:
- A four-failure-mode account of why codebase audits routinely fail to change anything — evidence decay, analytical silos, the findings cliff, and the unaccountable auditor (§2).
- A synthesis of the software-decay literature establishing that the first three failure modes are long-documented, structurally rooted, and — by the field's own assessment — unsolved (§3).
- An identification of the fourth failure mode as genuinely new introduced by the shift to autonomous agentic analysis, and grounded in the emerging empirical literature on multi-agent-system failure (§4).
- A cross-disciplinary genealogy of the “audit the auditor” problem, tracing it through classical philosophy, agency theory in economics, auditor-independence theory in accounting, and scalable-oversight research in AI safety (§6).
- Five design principles for a trustworthy audit holism, evidence-freshness grading, roadmap termination, self-accountability, and continuity under load — each tied to a specific failure mode and to the evidence that motivates it (§5).
- An explicit scoping of what a trustworthy audit requires at the level of principle, separated from any particular implementation of those principles.
Scope and non-goals
This is a paper about why. It analyses a problem and argues for a set of design commitments that answer it. It is deliberately not a paper about how: the orchestration topology, the encoding of conformance rules, the schema of the process record, and the inventory of specialist analyses are all implementation concerns, documented separately, and out of scope here.
This separation is not merely an editorial choice; it is part of the argument. The claim of this paper is that the principles are what make an audit trustworthy. An implementation is one realisation of those principles, and a good implementation can be evaluated only against principles stated independently of it.
Two further non-goals. First, this paper reports no benchmarks. Repo-Audit is implemented, but a rigorous empirical evaluation is future work, and we make no quantitative performance claims. Second, this paper does not survey the commercial audit-tooling market.
The Problem: Why Codebase Audits Fail to Land
An audit is an act of trust
Begin with the decision an audit is meant to inform. An audit succeeds when a team, having read it, redirects effort: it funds a remediation initiative, reorders a roadmap, halts a planned feature to address a risk. Every one of those responses is costly, and a team will pay that cost only in proportion to how much it trusts the audit and how cleanly the audit hands it something to act on.
Trust, here, is not a vague sentiment. It is a specific, defensible judgement that the audit's conclusions are (a) correct, (b) current, (c) complete with respect to the risks that matter, and (d) the product of a process that was actually carried out as claimed. An audit that cannot support that judgement will be — rationally — discounted. The four failure modes below are four distinct ways an audit fails to support it. They are not a ranked list of bugs; they are four independent leaks in the same vessel, and an audit that plugs three of them still does not hold water.
Failure mode I — Evidence decay
A codebase is not a fixed object. It is a process with a high rate of change, and an audit is a measurement taken of a moving target. By the time the report is written, reviewed, and circulated, the system it describes has moved.
This would be a manageable problem if the audit's own evidentiary basis were current. It is not. An audit reasons over the codebase and over a surrounding layer of secondary artefacts — architecture documents, design notes, code comments, READMEs, prior audit findings, the institutional memory of which assumptions still hold. That secondary layer is precisely the material that decays fastest, and §3 will show that its decay is not negligence but a structural property of software development.
The operative failure is not that some evidence is stale. It is that an ordinary audit has no representation of its own freshness. It presents a conclusion drawn from an eighteen-month-old design document with exactly the same visual and rhetorical confidence as a conclusion drawn from the current commit. The reader is given no way to tell which findings rest on solid ground and which rest on something that should have been re-verified. Faced with that opacity, a careful reader does the rational thing and discounts the entire report — because a document that will not tell you which of its claims are shaky has effectively told you that all of them might be.
Failure mode II — Analytical silos
The conventional audit decomposes its subject by concern. Security is assessed by one lens, architecture by another, test coverage by a third, documentation and dependency health by others. Each concern yields a section. The report is the concatenation of the sections.
Decomposition is a reasonable way to divide the work. It is a poor way to deliver the conclusion, because the most consequential risks in a codebase do not reside within any single concern. They reside in the interactions between concerns — in the seams. An architectural decision that quietly defeats a security control is invisible to a security scan that does not model architecture and to an architecture review that does not model the threat. A cluster of duplicated logic is a minor code-quality note in isolation; it becomes a serious finding only when cross-referenced with the observation that one of the copies is on the critical path of next quarter's roadmap and another has no test coverage.
A concatenation of expert sections cannot surface cross-cutting risk. The limitation is structural, not a matter of analyst diligence: no single reader of a siloed report ever holds all the findings in view at once, so the synthesis that would reveal the seam risk is never performed by anyone. What the audit needs is not more sections but a stage at which the findings are deliberately reasoned about against each other.
Failure mode III — The findings cliff
Consider where the conventional audit stops. It stops at a findings list: a severity-ranked enumeration of problems, sometimes with remediation hints. The implicit contract is that the team will take this list into a planning meeting and convert it into a roadmap — decide which findings become funded initiatives, in what order, traded off against every other demand on engineering time.
That conversion is the hardest and most valuable step in the entire exercise, and the audit declines to do it. The result is a discontinuity we call the findings cliff: the audit walks the reader confidently to the edge of "here is what is wrong" and stops, leaving the descent to "here is what we will actually do, and why, and in what order" entirely to a separate, unscheduled, and usually unfunded human effort.
Empirically, that effort frequently does not happen. A findings list of even moderate size is daunting to prioritise; without a proposed structure it tends to be deferred, then superseded by the next quarter's feature pressure. The audit is not wrong. It is simply inert — it terminated one step short of the form in which it could have been acted on.
Failure mode IV — The unaccountable auditor
The three failure modes above are old. The fourth arrived recently, with a genuinely beneficial change in how audits are produced.
Codebase audits are increasingly performed not by human consultants but by autonomous software agents built on large language models. This is, on balance, a substantial improvement: it makes a thorough audit fast, inexpensive, and repeatable in a way no human-staffed engagement ever could be, and §4 documents how rapidly the underlying capability has matured. But automating the auditor introduces a failure mode that simply did not exist when the auditor was a person.
An autonomous process that runs for an extended period over a large codebase can fail silently and partially. It can skip an analysis it was supposed to perform. It can lose its place — drop the thread of what it has already done and what remains. It can exhaust its working memory partway through and never recover. It can drift from its original objective, or terminate early in the belief that it is finished. Section 4 shows that each of these is a documented, named failure pattern of long-running agentic systems. The decisive characteristic is this: every one of them can occur while the agent still produces a final report that is fluent, well-structured, confident, and indistinguishable in form from a report backed by sound work.
When a human consultant audits your code, you can interrogate the process: ask what was examined, what was skipped, what assumptions were made. When an autonomous agent audits your code, the process is, by default, a black box, and the polish of the output is no evidence whatsoever about the soundness of the work behind it. The reader is asked to trust not only the findings but the unobservable conduct of the auditor.
This is the quis custodiet problem — who watches the watchmen — and §6 will show it is one of the oldest problems in the theory of delegated work. When the auditor is a machine, it stops being a rhetorical flourish and becomes a concrete engineering requirement: an audit cannot be trusted unless its own process can be audited.
Failure mode | What goes wrong | Status | Evidence | |
|---|---|---|---|---|
| I | Evidence decay | Findings rest on artefacts that have aged out; the audit has no representation of its own freshness | Long-standing | §3.1, §3.4 |
| II | Analytical silos | Concerns assessed separately; cross-cutting risk in the seams is never synthesised | Long-standing | §3 |
| III | The findings cliff | Audit terminates at a problem list; the translation to a roadmap is unplanned and rarely happens | Long-standing | §3.5, §5.3 |
| IV | The unaccountable auditor | An autonomous auditor can fail silently and partially while still emitting a confident report | New (agentic AI) | §4.3 |
Background: The Decay of Software Artifacts
The first three failure modes are not novel observations. They are the practical consequence of a body of empirical software-engineering research, accumulated over four decades, on how software and its surrounding artefacts decay. This section establishes that the decay is structural and, by the field's own assessment, unsolved — which is why an audit that ignores it is building on sand.
Decay is structural, not incidental
The foundational result is Lehman's laws of software evolution 2 . Studying long-lived systems, Lehman observed that a program embedded in the real world — an E-type system — must be continually adapted or it becomes progressively less satisfactory; and that, as it evolves, its complexity increases unless deliberate work is done to reduce it. Decay, in this framing, is not a symptom of a badly run project. It is closer to a thermodynamic tendency: the natural direction of an evolving system, resisted only by sustained, intentional effort.
Cunningham's technical-debt metaphor 1 gave the field its enduring vocabulary for the cost of that tendency. Two decades later, Kruchten, Nord, and Ozkaya moved the metaphor toward a usable theory 6 , and the Dagstuhl seminar on managing technical debt produced a consensus definition and research agenda 7 . Two findings from that maturation matter here. First, technical debt is mostly incurred unintentionally and discovered late — which is exactly the gap an audit exists to close. Second, and more pointedly for failure mode II, Kruchten et al. warn that tooling detects only what tooling is built to detect; the structural and architectural debt that spans components tends to evade per-concern analysis. The literature, in other words, names the silo problem directly.
Architectural erosion and drift
The artefact that decays most consequentially is the architecture. Perry and Wolf 3 gave the field the precise distinction: erosion, the active violation of architectural principles, and drift, a passive insensitivity to the architecture that accumulates as changes are made without reference to it. Either way, the architecture as documented and the architecture as running diverge.
Van Gurp and Bosch 4 showed, through a worked evolution example, that this divergence is effectively inevitable given how software is genuinely developed. De Silva and Balasubramaniam's survey 5 confirms the scale of the problem, cataloguing two decades of erosion-control research while making clear that no approach has solved it. The implication for auditing is direct. Any audit that takes a static architecture document as a description of the system is reasoning from a source that the literature predicts will be wrong, and wrong in ways the document itself cannot signal.
Documentation decay and code–comment inconsistency
The most direct threat to an audit's evidentiary basis is the decay of the documentation and comments the audit reads. Here the empirical record is unambiguous.
Fluri, Würsch, and Gall 9 examined whether code and comments co-evolve and found that they largely do not: only a minority of comment changes are triggered by the code changes around them, and newly added code is frequently left uncommented. Wen, Nagy, Bavota, and Lanza 10 scaled the question dramatically — mining roughly 1.3 billion abstract-syntax-tree-level changes across 1,500 systems — and concluded that code and comments do not co-evolve at all in most cases; co-evolution occurred only about a fifth of the time. Aghajani et al. 12 built an empirically grounded taxonomy of documentation problems, showing that documentation defects — including out-of-dateness — are systematic and categorisable rather than incidental. Potdar and Shihab 8 showed that even self-admitted technical debt, the debt developers explicitly flag in their own comments, is unreliable as a record.
One source adds a necessary nuance. Lethbridge, Singer, and Forward 11 , studying how engineers actually use documentation, confirmed that documentation is not maintained as completely as managers assume — but also found that out-of-date documentation often remains useful. The lesson for auditing is not that stale evidence is worthless. It is that stale evidence has an unknown trustworthiness, and an audit that cannot distinguish fresh evidence from stale is discarding information its reader urgently needs. The correct response is not to ignore aged artefacts but to grade them.
What the literature says about audits and reviews
Two further results frame what a trustworthy assessment should look like. Bacchelli and Bird's study of modern code review 14 found that although practitioners expect review to find defects, its actual outcomes are dominated by knowledge transfer, and that the binding constraint on review quality is the reviewer's understanding of the change and its context. This is a caution about the human fallback: the meeting in which an audit's findings are supposed to be turned into action is itself an unreliable instrument.
Kitchenham, Dybå, and Jørgensen's programme of evidence-based software engineering 15 supplies the positive standard. Importing the paradigm of evidence-based medicine, they argued that engineering decisions should rest on systematically gathered, critically appraised, and aggregated evidence rather than on expert opinion alone. That is the bar this paper holds an audit to: a trustworthy audit is one whose conclusions are evidence-chained — traceable to appraised evidence whose currency is known — rather than asserted.
The Auditor Problem: When Analysis Is Automated
The autonomous auditor is now feasible
The capability to have a software agent autonomously analyse and act on a repository has matured very fast. SWE-bench 16 introduced a benchmark of 2,294 real GitHub issues and reported that early strong models resolved only a low single-digit percentage — establishing both that repository-scale autonomous work is genuinely hard and that it had become measurable. SWE-agent 17 then showed that a purpose-built agent–computer interface lets a model autonomously navigate a repository, read and edit files, and run tests. The systematic review by Hou et al. 18 , covering 395 papers on large language models for software engineering, documents how broadly these techniques have already been deployed across the software lifecycle. The autonomous codebase auditor is not a speculative artefact; it is a current one. The question is no longer whether to automate the auditor but how to make the automated auditor trustworthy.
Multi-agent orchestration and its named enemy
A thorough audit is naturally decomposed across multiple specialist agents, and the multi-agent-orchestration literature provides the substrate. AutoGen 19 defines a framework for building applications from multiple conversable, configurable agents. MetaGPT 20 encodes human standardised operating procedures into agent workflows. ChatDev 21 structures a full development pipeline as a chain of communicating agents.
What is striking is that these frameworks independently name the same enemy. MetaGPT motivates its SOP-encoding design explicitly as a defence against "logic inconsistencies due to cascading hallucinations caused by naively chaining LLMs" 20 . ChatDev builds in a dedicated "communicative de-hallucination" mechanism 21 . The recurring lesson of the multi-agent literature is that chaining language-model agents does not merely risk individual errors; it risks compounding them — each agent inheriting and amplifying the drift of the last — and that any serious multi-agent system must include explicit machinery to resist that compounding.
How long-running agents fail
The mechanism of silent failure is now well characterised. Liu et al. 22 showed that language models do not use long contexts uniformly: performance is highest when relevant information sits at the very start or end of the context and degrades markedly when it sits in the middle, producing a characteristic U-shaped accuracy curve. This is the mechanistic basis for an auditor "losing its place": as the working context of a long audit fills, information deposited earlier becomes progressively less reliably available, even though it is nominally still present.
The most direct evidence comes from Cemri et al. 23 , who studied why multi-agent language-model systems fail and produced MAST, an empirically grounded taxonomy derived from more than 200 annotated execution traces across seven frameworks. MAST identifies fourteen failure modes in three categories — system design issues, inter-agent misalignment, and task verification failures — and the specific modes read as a precise description of the unaccountable auditor: step repetition, loss of conversation history, premature or unaware termination, and task derailment. These are not hypothesised risks. They are observed, catalogued behaviours of exactly the kind of system that now performs codebase audits.
Why the audit process must be observable
The constructive response comes from outside the agent literature, in the discipline of process mining. Van der Aalst 24 established a rigorous practice built on a simple premise: a process leaves an event log, and from that log one can discover the process that actually ran, check its conformance against a reference model of the process that was supposed to run, and detect deviations between the two.
This is precisely the capability the unaccountable auditor lacks and precisely the capability it needs. If an audit emits a complete, structured record of its own conduct — every analysis dispatched, every decision taken and on what evidence, every operational boundary encountered — then that record is an event log, and the audit process becomes conformance-checkable. The question "did the auditor actually do the work it claims?" stops being unanswerable and becomes a deviation check against an explicit model of correct conduct. Section 5 builds this into the principle of self-accountability.
Principles of a Trustworthy Audit
The preceding sections diagnose. This section prescribes. From the four failure modes and the literature behind them we derive five principles that, together, define a trustworthy audit. Repo-Audit is built around these five principles, and each subsection below states the principle, the failure mode it answers, the evidence that motivates it, and the corresponding commitment Repo-Audit makes, noting honestly how fully that commitment is realised today.
Principle | Answers | Evidence | Repo-Audit's commitment | Realised today |
|---|---|---|---|---|
| Holism | Mode II | [6] | One coordinated assessment; findings synthesised against each other | Yes |
| Evidence-freshness grading | Mode I | [9, 10, 11] | Every cited source graded by age; stale evidence surfaced, not hidden | Partly — source-level grading, not per-finding propagation |
| Roadmap termination | Mode III | [15, 25, 26] | The audit ends in strategic initiative briefs grounded in its findings | Partly — briefs grounded in findings; the initiative slate is curated |
| Self-accountability | Mode IV | [23, 24] | An inspectable, conformance-checked record of the audit's orchestration | Partly — orchestration invariants checked, not analysis depth |
| Continuity under load | Mode IV (mechanism) | [22, 23] | Operational boundaries handled as planned, state-carrying handoffs | Yes |
Holism — synthesis over silos
The principle. An audit must be a single coordinated assessment, not a concatenation of independent ones, and it must include an explicit stage at which findings from different concerns are reasoned about against each other.
Why. Failure mode II is structural: cross-cutting risk lives in the seams between concerns, and Kruchten et al. 6 state plainly that per-concern tooling cannot see it. The fix is not a better scanner for any one concern; it is the addition of a synthesis step that no siloed audit performs.
Repo-Audit's commitment. Repo-Audit coordinates its specialist analyses — spanning structure, security, test coverage, documentation, duplication, and architecture — as parts of one assessment, and then explicitly reasons across their findings. A cross-cutting conflict, such as an architectural choice that undermines a security control or a high-churn module with both poor coverage and a roadmap dependency, is treated as a first-class finding in its own right, not an observation that happens to fall between two sections.
Evidence-freshness grading
The principle. An audit must represent the currency of the evidence under each finding, and no finding may carry more confidence than its stalest supporting source.
Why. Section 3 established that documentation and comments decay as a structural matter 910 and that the trustworthiness of an aged artefact is unknown rather than zero 11 . The damage of failure mode I is done not by the existence of stale evidence but by the audit's silence about it. An audit that grades freshness converts an invisible risk into a visible, actionable one.
Repo-Audit's commitment. Repo-Audit grades every source it cites by currency, classifying each as fresh, recent, aging, or stale, and it surfaces that grade rather than hiding it: a conclusion resting on an aged design note is marked as exactly that. The governing principle is that no finding should rest, unflagged, on evidence that has gone stale. Carrying a source's staleness all the way through to a per-finding confidence score is a design goal the current implementation only partly meets — source-level grading is in place, finding-level propagation is not yet.
Roadmap termination — closing the findings cliff
The principle. An audit must not terminate at a findings list. It must terminate at a proposed roadmap of initiatives, each one traceable to the specific findings that justify it.
Why. Failure mode III is the discontinuity between "what is wrong" and "what we will do." Ruhe and Saliu 25 frame release planning as the explicit selection and sequencing of work under value, resource, and dependency constraints, and Phaal et al. 26 establish roadmapping as a structured technique linking current position to future intent. Kitchenham et al. 15 supply the standard the roadmap must meet — it must be evidence-chained, each proposed initiative resting on appraised evidence rather than assertion.
Repo-Audit's commitment. Repo-Audit performs the crossing as part of the audit. Rather than stopping at a findings list, it produces a set of strategic initiative briefs, each one grounded in — and annotated with — the specific findings from the audit that bear on it. The translation from diagnosis to plan happens inside the audit, delivered in a state a team can act on immediately.
Self-accountability — the audit you can audit
The principle. An audit must emit a complete, inspectable record of its own conduct, and that record must be checked against an explicit model of correct conduct.
Why. This is the direct answer to failure mode IV. Section 4 showed that autonomous auditors fail silently and partially; §4.4 showed that process mining 24 makes such failures detectable if the process emits an event log that can be conformance-checked. The principle simply requires the audit to produce that log and to check it.
Repo-Audit's commitment. Repo-Audit treats its own execution as a first-class subject of observation. It records what analyses were dispatched, what each concluded, where decisions were made and on what evidence, and what operational boundaries were encountered, and it checks that record against an explicit set of rules for how the audit must be conducted. These rules catch a real class of process deviation: an unlogged dispatch, a broken orchestration step, an unacknowledged boundary. They do not yet adjudicate the harder question of whether a dispatched analysis was substantively thorough. What the principle delivers today is the precondition for trust: the audit reports not only on the codebase but on its own conduct.
The trustworthy audit process
Coordinated specialist analyses
↓ evidence graded by freshness
Synthesis: cross-cutting findings
Strategic initiative briefs
Throughout → Conduct record (analyses run, decisions, boundaries)
Conduct record → Conformance check vs. explicit rules
Conformance check: conforms → Verdict a team can trust
Conformance check: deviation → Surfaced for review
Continuity under load
The principle. An audit must treat its own operational limits — exhaustion of working memory, excessive elapsed time — as planned, state-preserving transitions, not as crashes or silent truncations.
Why. A genuine audit of a real codebase consumes time and working context, and §4 established that as context fills, earlier information becomes unreliable 22 and long-running agents terminate prematurely or lose their place 23 . An audit that hits a limit and either crashes or, worse, stops quietly while emitting a complete-looking report has failed in exactly the way that is hardest to detect.
Repo-Audit's commitment. Repo-Audit anticipates its boundaries. As it approaches a limit, it hands the audit off — carrying its accumulated state forward into a fresh session — and continues deliberately, so that the audit reaches its end on purpose rather than failing in place or truncating silently.
Holism, freshness grading, and roadmap termination make an audit useful. Self-accountability and continuity under load make it trustworthy. All five are non-negotiable.
Cross-Disciplinary Framing: Who Audits the Auditor
Sed quis custodiet ipsos custodes? — But who will watch the watchmen themselves? — Juvenal, Satires VI
The deepest claim of this paper is that failure mode IV — the unaccountable auditor — is not a new problem. It is the newest instance of one of the oldest problems in the theory of delegated work, and recognising its lineage is what allows the response to it to be principled rather than improvised.
The classical problem
The question is Juvenal's, and its framing — guardians set to watch over a household, and the unresolved problem of who then watches the guardians — echoes a concern already present in Plato's Republic about the guardians of the just city. The classical formulation already contains the essential structure: oversight cannot be fully self-enforced, because any overseer is itself a party that may fail or defect, and appointing an overseer of the overseer only relocates the problem. The classical texts pose the problem; they do not solve it. What later disciplines contribute is not a solution that closes the regress but a set of mechanisms that make the residual risk small and observable.
Economics: agency theory
Economics gave the problem its formal structure. Jensen and Meckling's theory of the firm 31 analyses the agency relationship: whenever a principal delegates work to an agent, the agent's interests and information do not perfectly align with the principal's, and the agent will not always act as the principal would wish. The divergence cannot be eliminated; it can only be managed, and managing it has a price — agency costs, comprising the principal's monitoring expenditure, the agent's bonding expenditure, and an irreducible residual loss.
This framing transfers exactly to the automated auditor. An autonomous audit agent is an agent in the precise economic sense: a delegate, acting on a principal's behalf, whose conduct the principal cannot directly observe. Its silent drift — the skipped analysis, the premature termination — is not a mysterious software defect; it is an agency cost. Agency theory's lesson is twofold and bracing: such costs are real and never zero, and the rational response is not to wish them away but to invest deliberately in monitoring that brings them down. Self-accountability (§5.4) is that monitoring investment.
Accounting: auditor independence
Accounting is the discipline that has thought longest and hardest about the specific case of an auditor as the delegated agent. DeAngelo's analysis of auditor independence 32 examines why the independence on which audit quality depends is structurally fragile — how the incentive structures surrounding an audit engagement can quietly compromise the auditor's objectivity even when no party intends it. The accounting tradition's response is institutional: independence requirements, the audit trail, conformance to explicit standards, and review of the audit itself.
The relevant transfer is the concept of the audit trail. Accounting long ago concluded that an audit which cannot be retraced cannot be trusted, and that the auditor must therefore leave an inspectable record of what was examined and concluded. The conduct record of §5.4 is an audit trail for an automated auditor, and the conformance check is the descendant of accounting's insistence that an audit be verifiable against an explicit standard rather than accepted on the auditor's authority.
AI safety: scalable oversight
The most recent discipline to confront the problem is AI safety, which has rediscovered it — under the name scalable oversight — with new urgency, because its agents may exceed human ability to check their work directly. Amodei et al. 27 named scalable supervision as a concrete safety problem: the difficulty of adequately overseeing an agent whose objective is too expensive to evaluate at every step. Irving et al. 28 proposed a mechanism — framing oversight as a debate between agents adjudicated by a weaker judge. Bowman et al. 29 turned scalable oversight into an empirical research programme. Saunders et al. 30 built a practical primitive — models trained to write natural-language critiques of other models' outputs — while honestly reporting its limit: critique ability lags task ability, so an overseer may simply be unable to catch the most sophisticated errors.
That honest limit matters for this paper. It confirms that the quis custodiet regress does not fully close: an audit's conduct can be checked against explicit rules, but the rules themselves are an artefact that must be designed, and no mechanism makes oversight perfect. The claim of §5 is not that self-accountability eliminates the risk of a bad audit. It is that self-accountability makes the risk observable and small — which, agency theory tells us, is the most any monitoring investment can achieve.
Discipline | The problem's name there | Core contribution | Source |
|---|---|---|---|
| Classical philosophy | Quis custodiet ipsos custodes | Names the structural impossibility of fully self-enforced oversight | Juvenal; Plato |
| Economics | Agency costs | An undetected auditor's drift is an agency cost; monitoring is a rational, priced investment | [31] |
| Accounting | Auditor independence; the audit trail | An audit must leave an inspectable trail and be verifiable against an explicit standard | [32] |
| AI safety | Scalable oversight | Mechanisms — critique, debate, conformance — make oversight tractable, though never perfect | [27, 28, 29, 30] |
The synthesis is this. An automated codebase auditor is an agent in the exact sense economics defined; its silent failure is an agency cost in the exact sense economics priced; the remedy — an inspectable trail checked against an explicit standard — is the one accounting arrived at centuries ago; and the formal study of how to oversee an agent that may outpace its overseer is the active frontier of AI safety. Repo-Audit's principle of self-accountability is not an invention. It is the application, to a new kind of auditor, of a remedy four disciplines independently converged on. That convergence is strong corroboration that the principle is sound.
Implications and Open Questions
What changes when audits become cheap and frequent
Automation does not merely make the existing audit faster; it changes what an audit is. A human-staffed audit is necessarily a rare event — expensive, scheduled, disruptive — and its rarity is itself a source of failure mode I, because a once-a-year snapshot is stale for most of the year it describes. An automated, accountable audit can be run often: before a refactor, at the start of a planning cycle, after a significant architectural change. Frequency is the most powerful single defence against evidence decay, because a finding's evidence cannot age far if the audit is re-run before it does.
This raises a genuine tension. If audits become cheap enough to run continuously, the distinction between a periodic audit and continuous monitoring begins to blur. We take the view that the two remain distinct in kind: monitoring answers "did anything just change for the worse?" against a fixed model, while an audit re-derives the model — re-examines whether the codebase's structure, risks, and roadmap readiness still hold as a whole.
The limits of automated judgement
A trustworthy audit narrows a decision; it does not make it. The five principles ensure the audit is holistic, current, actionable, verifiable, and complete — but the question of which initiatives to actually fund, against the full context of a business, remains a human decision and should. The distinction worth drawing is between detection and judgement. An automated auditor's detection can be made highly trustworthy through the principles of §5. Its judgement — that this therefore warrants a funded initiative ahead of a competing one — is a weaker claim, dependent on context the audit does not hold. Repo-Audit's roadmap output is correctly understood as an evidence-chained proposal that makes the human decision defensible and fast, not as a substitute for it.
Open questions
- 01The regress does not fully close. Self-accountability checks the audit's conduct against an explicit set of rules — but two gaps remain. First, the conduct record is produced by the audit process itself; a self-reported log is weaker evidence than an independently observed one. Second, who verifies that the rules themselves are the right ones? Section 6's evidence indicates that an overseer's reach is bounded, and the conformance rules are themselves a decaying artefact subject to the very failure modes this paper describes. Self-accountability narrows the trust gap; it does not close it.
- 02Synthesis quality is hard to measure. The holism principle asserts that cross-cutting synthesis surfaces seam risks a siloed audit misses. We believe this and the literature supports the mechanism — but there is no established benchmark for 'did the audit find the cross-cutting risks that mattered,' and constructing one is non-trivial precisely because the seam risks are the ones nobody enumerated in advance.
- 03The audit/monitoring boundary. As argued in §7.1, cheap automated audits press on the distinction between periodic assessment and continuous monitoring. The right cadence for a freshness-aware audit — and whether, past some frequency, the audit should simply become a monitor — is unresolved.
- 04No empirical evaluation yet. This paper is a problem analysis and design rationale. The central practical hypothesis — that audits built on the five principles are acted on more often than audits that are not — is plausible and motivates the design, but it is untested. A controlled evaluation is the most important item of future work.
- 05Several principles are only partly realised. This paper states five principles and describes Repo-Audit as built around them; in candour, the system meets them to varying degrees today. Holism, and continuity as state-carrying handoff, are in place. Freshness grading is implemented at the level of sources but not yet propagated into a per-finding confidence score. The conformance layer checks orchestration invariants but does not yet adjudicate the substantive thoroughness of each analysis. The roadmap stage works from a curated set of recurring initiative themes rather than a slate fully emergent from each codebase's findings.
Conclusion
A codebase audit fails not when it overlooks a problem but when nothing changes after it lands — and that outcome is the rule, not the exception. This paper has argued that the inertness of the typical audit has four specific, separable causes. Three are long-documented properties of how software and its artefacts decay: the audit goes stale faster than it can be read, its siloed structure renders cross-cutting risk invisible, and it terminates one step short of the roadmap that would make it actionable. The fourth is new, and arrived with the automation of the auditor: an autonomous agent can fail silently and partially while still emitting a confident report, leaving the reader to trust an unobservable process.
The response is to stop treating an audit as a document and start treating it as an accountable process. Five principles follow: holism, so that findings are synthesised rather than concatenated; evidence-freshness grading, so that no conclusion outruns the currency of its evidence; roadmap termination, so that the audit crosses the gap from diagnosis to plan itself; self-accountability, so that the audit's own conduct leaves an inspectable, conformance-checked trail; and continuity under load, so that the audit reaches its end on purpose. Repo-Audit is an implemented audit skill built on these five.
The cross-disciplinary record is the reason to believe the design is sound rather than merely plausible. The problem of overseeing a delegated auditor was named in classical philosophy, priced by agency theory in economics, answered with the audit trail by accounting, and is now the active frontier of scalable-oversight research in AI safety. Four disciplines, across two thousand years, converged on the same remedy: an agent's work becomes trustworthy not when the agent is assumed reliable but when its conduct is made inspectable. An audit earns the right to redirect a team's effort only when the audit itself can be audited. That is the standard Repo-Audit is built to meet, and the standard this paper argues every automated audit should be held to.
An audit earns the right to redirect a team's effort only when the audit itself can be audited.
Appendix A: Glossary
- Agency cost
- In agency theory [31], the unavoidable cost of delegating work to an agent whose interests and information differ from the principal's: monitoring expenditure, bonding expenditure, and residual loss. An automated auditor's silent drift is an agency cost.
- Analytical silo
- An audit structure in which each concern (security, architecture, testing, documentation) is assessed and reported separately, with no stage that synthesises findings across concerns.
- Conduct record
- A complete, structured, inspectable log of an audit's own execution: the analyses it dispatched, the decisions it made, and the operational boundaries it encountered.
- Conformance check
- A comparison of an audit's conduct record against an explicit model of correct conduct, used to detect deviations from the audit's prescribed orchestration. The term derives from process mining [24].
- Cross-cutting risk
- A risk that exists only in the interaction between two or more concerns and is therefore invisible to any single-concern analysis. Also called seam risk.
- Evidence decay
- The structural tendency of the artefacts an audit relies on — documentation, comments, design notes — to fall out of sync with the code over time [9, 10].
- Findings cliff
- The discontinuity at which a conventional audit stops: it delivers a list of problems but not the roadmap that would make the problems actionable.
- Freshness grading
- Grading every cited source by currency and surfacing that grade, so that no finding rests unflagged on evidence that has gone stale.
- Roadmap termination
- A design commitment that an audit ends not in a findings list but in a set of strategic initiative briefs, each grounded in the findings that bear on it.
- Scalable oversight
- In AI safety [27, 29], the problem of adequately supervising an agent whose work is too expensive, or too advanced, for an overseer to evaluate directly at every step.
- Self-accountability
- The design commitment that an audit emits a conduct record and submits it to a conformance check, making the audit's own process inspectable.
- Unaccountable auditor
- An autonomous audit process that can fail silently and partially — skipping steps, drifting, losing context, terminating early — while still emitting a confident, well-formed report.