The Robustness Illusion

KellerAI

Section 01

“Does not crash” is not “works correctly”

Most teams measure the health of a production system by one number: how often it falls over. A clean dashboard, no 500s, no pages overnight — the system is fine. But “the process kept running” and “the process did the right thing” are two different claims, and a great deal of software satisfies the first while quietly failing the second. The illusion is in mistaking the absence of a crash for the presence of correctness. Once you start reading code with that distinction in mind, a scattered set of habits resolves into a single, recognizable pattern: failure is happening, and the system has been built to keep that failure to itself.

Section 02

The four ways code buries a failure

Silent failure is not one mistake; it is a small family of them, and they sort neatly by how far each one pushes a problem away from where it could still be fixed. The first is the exception firewall: a broad except clause catches everything, writes a log line, and lets the function return as if nothing went wrong. The second is log-and-continue: the same move applied to a startup invariant, so the process boots in a broken state and fails confusingly half an hour later. The third is the Optional lie — a function whose return type uses one empty value to mean both “this record does not exist” and “the lookup failed,” collapsing two very different situations into one. The fourth is the deferred visible failure: the system tells the user that something broke but not what , wrapping a generic apology around an exception whose real cause is now lost.

Section 03

When the fallback bypasses authentication

The pattern stops being a hygiene problem at its sharpest point. In a real codebase, a function resolves how to clone a linked GitHub repository. When the data needed to build credentials is incomplete, the function logs an error and falls back to cloning the raw URL with no credentials at all. If the repository happens to be public, the clone succeeds, the workspace prepares, the dashboard shows a green check — and nothing records that the system just skipped its own authentication step. An error path was quietly repurposed as a fallback path, and the fallback was less restrictive than the path it replaced. The single log line was asked to do two jobs at once: be the audit trail and be the response. When one line does both, the response always wins and the audit is read by no one.

Section 04

Why good engineers write this code

Nobody writes a broad except clause because they enjoy hiding bugs. The patterns emerge from rational pressure. A visible 500 looks worse on the dashboards management watches than a stale field or an orphaned file nobody notices for weeks, so engineers are rewarded for making failures invisible. A demo cannot crash, so a try/except goes in the week before, and the “real” handling never arrives. A defensive pattern that once worked gets copied to the next similar-looking problem. And the easy type — the Optional — is one keystroke, while a type that carries the failure case honestly takes a library and an argument in code review. Each choice is reasonable on its own. The anti-pattern is what they add up to.

Section 05

Errors as values, not events

Other traditions made the opposite choice and are treated as models of reliability for it. Erlang's “let it crash” lets a process fail loudly so a supervisor — which knows how to recover — can act on it. Go returns an error value from every function that can fail, so the caller cannot pretend the error does not exist. Rust's type system refuses to let an error be dropped silently. The shared insight is that errors should be values the calling code can inspect, classify, and act on — not events that happen off to the side. The goal is never to maximize failures; it is to maximize their visibility at the earliest point where the system still has the context to do something useful.

Section 06

The fix is small; the shift is cultural

None of this calls for an architectural rewrite. An exception firewall is fixed by recording a structured incident and then re-raising, so the caller still gets the error it needed. A broken startup invariant should refuse to boot, surfacing during deploy instead of mid-traffic. An Optional that hides a failure should be replaced with a type or a specific exception that keeps “not found” and “failed” distinct. A generic failure message should carry a classification so the user is told what actually went wrong. And a security check that cannot complete should fail closed, never fall through. Each remediation is a handful of lines. The hard part is the cultural shift: a code review that treats the absence of a structured failure record as a defect, not a style preference. “Does not crash” is the lowest bar in software. Production systems deserve a higher one.

Section 07