The Token You Didn't Count

KellerAI

Section 01

The Bill That Grew Without a Price Increase

On 16 April 2026, Anthropic shipped Claude Opus 4.7 at the same headline price as the model it replaced: five dollars per million input tokens, twenty-five dollars per million output tokens. The rate card did not move. The same migration documentation noted that the new model's tokenizer produces, for the same text, between one and roughly 1.35× the tokens of its predecessor — up to about 35% more billable units for an unchanged prompt. A price that stayed flat and a bill that can rise by a third are not a contradiction. They are the gap this paper is about.

The gap is not an Anthropic story; it is an industry story that happened to surface there first. In the same quarter, GitHub Copilot announced the retirement of its flat premium-request model, GPT-5.5 arrived at roughly double its predecessor's rate, Windsurf imposed usage quotas, and Anthropic moved enterprise pricing toward a per-seat-plus-usage hybrid. And the cost of the change is no longer hypothetical: by April 2026, Uber had exhausted its full-year AI budget — its chief operating officer conceding it was “hard to draw a line” from the spend to specific shipped features.

Anthropic did not raise prices. Your bill may still grow. When the tokenizer changes, the rate card you are watching is no longer the cost you are paying.

The load-bearing claim

Section 02

The Tokenizer Is Not the Rate Card

A rate card prices a token. A tokenizer decides how many tokens a given piece of text becomes. Watch only the first and you will miss a change in the second entirely — which is precisely what a same-price model upgrade can deliver. Opus 4.7 kept the rate of 4.6 while shipping a redesigned tokenizer whose own migration guide quantifies the effect as “up to ~35% more” tokens, varying by content.

Independent analysis confirms the mechanism and its variability. One FinOps vendor measured the same upgrade across three workload shapes and found per-request token growth ranging from roughly zero to the full 35%, with code, structured data, and non-English text clustering at the high end. A separate pricing comparison reached the same conclusion: the headline per-token price is no longer a reliable proxy for the cost of a request once the tokenizer is in play. The 35% figure is a ceiling, not a measured median — we treat it as an upper bound throughout, and the in-depth companion is explicit that no independent median has been published.

The operationally important detail is the absence of a signal. A rate change triggers a billing alert; a tokenizer change does not, because no price field moved. The only pre-migration signal available is to run a production sample through the new model's count_tokens endpoint and compare the token counts directly — an audit no standard billing dashboard performs on its own.

A tokenizer change is a price increase that no alert system will catch. The rate card did not move, so nothing fired — and the bill rose anyway.

The mechanism

Section 03

The End of Flat Rate

The tokenizer is one of three cost shocks that converged in the first half of 2026, and the second is structural: the flat-rate era ended. GitHub Copilot announced that on 1 June 2026 it would retire its premium-request units in favor of consumption-based AI Credits priced at one cent each, cushioning the transition with promotional credits through the summer. In the same window, GPT-5.5 launched at a rate reported at roughly double its predecessor's, Windsurf introduced usage quotas, and Anthropic closed flat-rate access for third-party agent frameworks before moving enterprise pricing toward a per-seat-plus-usage hybrid.

What makes this a structural shift rather than a string of unrelated repricings is that no single vendor announced it as a category change. Each move was framed as a product decision; collectively they retired the predictable subscription as the dominant commercial model for AI tooling. A buyer who budgeted on a flat seat price in 2025 now budgets on consumption — a quantity that, as the tokenizer section showed, the buyer does not directly control and cannot reliably forecast.

The flat-rate era ended in the first half of 2026. No single vendor announced it. It happened — one product decision at a time.

The structural shift

Section 04

Agentic Amplification, and Three Things to Measure

The third shock is the one that turns a 35% per-request increase into a budget event: agentic amplification. An autonomous agent does not make one model call; it loops, and every turn of the loop is billable regardless of whether the turn made progress. Uber's expansion to thousands of engineers on AI coding tools from late 2025 onward, and its exhaustion of a full-year budget by April, is the failure mode in the open: per-developer spend can vary by an order of magnitude, and an unproductive trajectory costs exactly what a productive one does. The same pressure shows up where organizations pull back — one large vendor reportedly cancelled most internal agent-coding licenses for a division rather than absorb open-ended usage.

Three measurements convert this from a surprise into a managed cost. None is exotic; each is the FinOps discipline already applied to cloud spend, pointed at the cost vectors this paper has named.

01Effective cost per request, not price per token. Instrument the actual token count of production requests against the new model before and after a migration. The rate card is a constant; the token count is the variable, and only the product of the two is your bill. A count_tokens audit on a representative sample is the single signal a tokenizer change will otherwise hide.
02Cost per completed task, not cost per call. In an agentic loop every turn is billable, and an unproductive path costs the same as a productive one. Measure the cost of a resolved ticket, a finished feature, or a completed task — not the cost of an individual model call — because that is the unit that connects spend to value and surfaces the runaway loops that exhausted a full-year budget in four months.
03Cache hit rate, before and after. Prompt caching is the highest-leverage offset available, with cache reads billed at a fraction of fresh input. A tokenizer change can invalidate cache prefixes, silently erasing that discount at the same moment token counts rise. Measure the cache hit rate across any migration so the two effects do not compound unseen.

The companion in-depth paper develops each cost vector against its evidence base — the vendor documentation, the incident record, and the FinOps research showing that nearly every surveyed organization now manages AI spend as a first-class concern — and proposes what a cost-observable system looks like. For the tokenizer mechanics, the flat-rate timeline, and the proposed instrumentation in full, read the in-depth companion .

Section 05

The Point

The FinOps Foundation's 2026 survey of more than a thousand organizations found that managing AI spend had moved from a minority concern to a near-universal one, and that granular token, model, and GPU monitoring topped the list of requested capabilities — because AI spend, in the words of that research, has become functionally hard to forecast. That is the industry naming the same gap from the other side: when the rate card no longer predicts the bill, the only thing that does is measurement.

The right unit of measurement is the cost of an outcome, not the cost of a call. Cost per resolved ticket, cost per completed agentic task, cost per shipped feature — these are the figures that survive a tokenizer redesign and a pricing-model change, because they are keyed to value delivered rather than to a rate card that no longer means what it used to. Standard dashboards report aggregate spend; effective cost per outcome is the artifact a governed AI program needs and most do not yet produce.

Three structurally different cost shocks — a tokenizer redesign, a platform-wide move to usage-based billing, and agentic loop amplification — all arrived in the same two months, and they share one consequence: the headline price and the effective cost have decoupled. Cost observability is no longer a finance hygiene task. It is a governance artifact — the instrument that tells you, after you ship a model upgrade, what it actually costs.

The headline price and the effective cost have decoupled. Cost observability — cost per resolved ticket, per task, per feature — is the governance artifact that closes the gap.

The point