The Token You Didn't Count: Tokenizer Opacity, the End of Flat Rate, and Cost Observability as a Governance Artifact

KellerAI

Section 01

Abstract

On 16 April 2026, Anthropic released Claude Opus 4.7 at the same headline rate as the model it replaced — five dollars per million input tokens and twenty-five dollars per million output tokens — while its migration documentation noted that the model's new tokenizer produces between one and roughly 1.35× the tokens of its predecessor for the same text, up to about 35% more billable units for an unchanged prompt. ¹ A flat rate card and a bill that can rise by a third are not a contradiction. They are a decoupling: the headline price a buyer watches and the effective cost a buyer pays are no longer the same number.

This paper makes cost the headline rather than a caveat. Where a companion KellerAI paper treats a model upgrade as a change-management event — asking whether the upgrade is safe to ship — this paper asks the question that comes after deployment: once you have shipped it, do you know what it costs? ² The two questions are complementary and the boundary is deliberate. This paper owns the economic-governance lane: tokenizer opacity, the structural end of flat-rate pricing, agentic loop amplification, and the instrumentation that makes effective cost visible.

The central claim is a convergence claim. Three structurally different cost shocks — a tokenizer redesign that raises token counts without touching the rate card, a platform-wide migration to usage-based billing, and the amplification of per-request cost by autonomous agent loops — all arrived in the same two months of 2026, and they share one consequence. The bill is now a function of a quantity the buyer neither directly controls nor reliably forecasts. When the rate card stops predicting the bill, the only thing that does is measurement, and measurement of effective cost per outcome becomes a governance artifact rather than a finance hygiene task.

The paper proceeds in three movements. First, it establishes the three cost vectors against primary vendor documentation and incident reporting. Second, it argues from FinOps research that cost observability has crossed from optional to load-bearing, and proposes — explicitly as a design, not a product — what a cost-observable system looks like. Third, it situates the argument against a regulatory overlay, prescribes three checks before any model upgrade, and closes with an Honest Limits section that is candid about where the evidence is a ceiling rather than a median.

For the leadership-level version — the load-bearing claim, the worked examples, and the three-line checklist — read the companion brief, The Token You Didn't Count .

Anthropic did not raise prices. Your bill may still grow. The headline price and the effective cost have decoupled — and cost observability, not a rate card, is what now tells you the difference.

The thesis

Section 02

The Tokenizer Is Not the Rate Card

A token bill is the product of two numbers: the price of a token and the count of tokens a request becomes. The rate card prices the first. The tokenizer — the component that splits text into the units a model bills against — determines the second. Most cost discipline watches the rate card, because the rate card is the number a vendor publishes, announces, and changes with notice. The tokenizer is the number a vendor can change silently inside a model upgrade, and that is exactly what happened in April 2026.

Claude Opus 4.7 shipped at the identical per-token rate as Opus 4.6. ¹ The migration guidance accompanying the release stated that the new model's tokenizer produces “1x–1.35x tokens” for the same input — up to roughly 35% more billable units — with the precise figure varying by content. ¹ For a workload of unchanged prompts, that is a price increase in every sense that matters to a budget, delivered through a channel no rate alert monitors.

The variability is not noise; it is structural, and it determines who is exposed. An independent FinOps analysis measured the same upgrade across three representative workload shapes and found per-request token growth spanning from approximately zero to the full 35%, with code, structured data such as JSON, and non-English text clustering at the upper end and prose at the lower. ³ A separate pricing comparison reached the same structural conclusion: the headline per-token price has ceased to be a reliable proxy for the cost of a request once a tokenizer change is in play, and effective cost must be computed per workload rather than read off a table. ⁴ A team whose traffic is heavily code or structured data sits near the ceiling of the range; a team whose traffic is conversational English may barely notice. The same rate card produces materially different bills.

The operational consequence is the absence of a signal. A rate change is an event: it has an announcement, an effective date, and a billing line that moves, any of which a finance process can latch onto. A tokenizer change is a non-event in every system tuned to watch price, because no price field changed. The bill rises, and nothing fired. The only reliable pre-migration signal is to run a representative sample of production requests through the candidate model's count_tokens endpoint and compare the token counts against the incumbent — an explicit audit that no standard billing dashboard performs unprompted. ¹

It is worth stating the limit of the evidence here, because the rest of the paper leans on it. The 35% figure is Anthropic's stated upper bound, not a measured median across real workloads; no independent median has been published, and a responsible reading treats 35% as a ceiling and the realized figure as workload-dependent and, for most teams, lower. ³ The argument does not require the ceiling to be the norm. It requires only that the effect be real, invisible to rate-based monitoring, and variable enough that a team cannot assume its own exposure without measuring it — all three of which the evidence supports.

A tokenizer change is a price increase that no alert system will catch. The rate card did not move, so nothing fired — and the bill rose anyway, by an amount only a token-count audit could have predicted.

The mechanism

Section 03

The End of Flat Rate

The tokenizer is the subtle cost vector. The second is structural and unsubtle: across the first half of 2026, the flat-rate subscription ceased to be the dominant commercial model for AI tooling. The shift was distributed across vendors and framed, in each case, as an ordinary product decision — which is precisely why it was easy to miss as a category change.

GitHub Copilot is the clearest signal because it retired a flat model outright. The platform announced that on 1 June 2026 it would discontinue its premium-request units in favor of consumption-based AI Credits priced at one cent per credit, with promotional credit grants to cushion the transition through the summer months. ⁵ A premium request was a unit a buyer could count and cap; an AI Credit is a unit of consumption whose total depends on how much work the model does on a buyer's behalf. The change moves the budgeting variable from a quantity the buyer sets to a quantity the workload determines. ⁶

The same window saw the pattern repeat across the market. GPT-5.5 launched at a rate reported at roughly double its predecessor's, shifting the per-token economics of a widely used model upward in a single release. ⁷ Windsurf introduced usage quotas, converting an effectively unbounded allowance into a metered one. ⁸ Anthropic closed flat-rate access for third-party agent frameworks in early April and, by mid-April, was reported to be moving enterprise pricing toward a hybrid of a per-seat fee plus usage — a model that preserves the predictability of a seat while reintroducing consumption as the variable that scales with use. ⁹

The structural point is the absence of a single announcement. No vendor declared the flat-rate era over; each made a local decision that, in aggregate, retired the predictable subscription as the default. The consequence for a buyer is concrete. An organization that built its 2025 budget on a flat seat price was budgeting a constant. The same organization in 2026 budgets a consumption figure — a quantity that, as Section 2 established, it does not directly control and, as Section 4 will argue, agentic workloads make actively hard to forecast. Flat rate was not merely cheaper for heavy users; it was predictable for everyone. Its disappearance removes the predictability, which is the property a budget is built on.

The flat-rate era ended in the first half of 2026. No single vendor announced it. It happened — one product decision at a time — and it took the predictability of the budget with it.

The structural shift

Section 04

Agentic Amplification

The third cost vector is the multiplier that turns a per-request increase into a budget event. A single model call has a bounded cost. An autonomous agent does not make a single call; it runs a loop — plan, act, observe, revise — and every turn of that loop is a billable model invocation. Crucially, the loop is billed by consumption, not by outcome: an unproductive trajectory that wanders, retries, and backtracks costs exactly what a clean trajectory costs per turn, and frequently costs more in total because it takes more turns. The cost of agentic work is therefore not the cost of a task; it is the cost of however many turns the agent took to attempt the task, whether or not the attempt succeeded.

The failure mode is already in the open. Uber expanded its use of AI coding tools to thousands of engineers from late 2025 onward and exhausted its full-year 2026 AI budget by April — with the company's chief operating officer publicly conceding that it was “hard to draw a line” from the spend to specific shipped features. ¹⁰ That admission is the whole problem in one sentence. When spend cannot be attributed to outcomes, the organization is buying turns, not results, and it has no instrument to tell the difference. Industry analysis of agentic-coding spend finds that per-developer cost can vary by an order of magnitude for nominally similar work, because the variation lives in the loop behavior rather than in the headline rate. ¹¹

The pressure shows up symmetrically where organizations pull back. A large vendor reportedly cancelled most internal agent-coding licenses for an engineering division rather than absorb open-ended consumption, a decision that reads as a rational response to a cost it could not forecast or cap. ¹² And the trajectory is toward more amplification, not less: industry forecasts anticipate agent usage growing by an order of magnitude over the next several years, which compounds every per-request and flat-rate effect already described. ¹³ We note that this forecast is a vendor-cited projection rather than a measured trend, and treat it as directional.

The three vectors interact, and the interaction is the danger. A tokenizer change raises the cost of each turn; the end of flat rate removes the cap that would have bounded the total; and agentic amplification multiplies the number of turns. A team exposed to all three simultaneously — a code-heavy agentic workload on a newly repriced, newly metered platform — experiences not the sum of the three effects but their product. The convergence is what makes early 2026 a structural moment rather than a sequence of unrelated price notices.

Every loop turn is billable, and an unproductive path costs the same as a productive one. The cost of agentic work is the cost of the turns it took, not the value of the task it finished.

The multiplier

Section 05

Why Cost Observability Is a Governance Artifact

When the rate card no longer predicts the bill, the predictive burden falls on measurement. This is not a finance preference; it is a structural necessity created by the three cost vectors, and the industry has recognized it at scale. The FinOps Foundation's State of FinOps 2026 survey — drawing on more than a thousand organizations collectively managing tens of billions of dollars of cloud and AI spend — found that managing AI spend had moved from a minority practice to a near-universal one in the space of a year, and that it now ranks as the top priority for the discipline. ¹⁴ A concern that nearly every surveyed organization shares is no longer an optimization; it is a governance baseline.

The specific capability organizations are asking for names the problem precisely. The top tooling request in the same research is granular, per-token, per-model, and per-GPU monitoring — the ability to see consumption at the resolution where the cost vectors actually operate, rather than as an aggregate line on a monthly invoice. ¹⁵ The reason is stated plainly in the supporting analysis: AI spend has become functionally hard to forecast, because the quantity that drives it — tokens consumed, weighted by a tokenizer the buyer does not control, multiplied by loop turns the workload determines — is not knowable in advance from the rate card. ¹⁶

That this is a governance problem rather than a budgeting one becomes clear when the spend is contextualized. Survey data indicates that a large majority of organizations exceeded their cloud budgets in the relevant period, a pattern that predates AI but that AI's unforecastable consumption sharpens. ¹⁷ A budget overrun is a finance event. The inability to attribute spend to outcomes — Uber's “hard to draw a line” — is a governance failure, because it means the organization cannot answer whether its AI spend is producing value, cannot hold a team accountable for the cost of a feature, and cannot make an informed decision about whether to ship a model upgrade that quietly raises the cost of every request.

The connective tissue is cost per outcome. A token count is a measurement; a dollar figure is a measurement; neither answers the governance question on its own. The governance question is whether the cost was worth it, and that requires linking consumption to the unit of value the organization actually cares about — a resolved support ticket, a completed agentic task, a shipped feature. Cost per outcome is the artifact that connects tokens to value, and it is precisely the artifact that aggregate spend reporting cannot produce. ¹⁶ This is where cost observability joins the broader governance argument KellerAI has made elsewhere: just as a system must be able to validate and audit its own behavior, it must be able to audit its own economics, and the same discipline that demands a record of what an agent did demands a record of what the agent's actions cost.

A budget overrun is a finance event. Being unable to say whether your AI spend produced value is a governance failure. Cost per outcome is the artifact that tells the two apart.

The governance crossing

Section 06

What a Cost-Observable System Looks Like

The remainder of this section proposes a design. We are explicit that it is a design and not a product: it describes the layers a cost-observable system needs, drawn from what existing FinOps tooling already does and does not do, rather than endorsing any vendor. The gap it addresses is well documented — standard dashboards report aggregate spend, not effective cost per request, and treat the rate card as the cost rather than as one factor in it. ¹⁸ Some platforms have begun to close part of the gap; cost-management tools that integrate a provider's usage and cost API can surface consumption at a finer grain than an invoice. ¹⁹ The design below describes the full shape such a system would need to take to make effective cost per outcome a first-class figure.

Three layers compose into cost observability. Each addresses one of the places where standard reporting loses the signal.

01Per-request instrumentation. Capture the actual input and output token counts of every model invocation at the moment it happens, not the estimated or rate-card cost. This is the layer that makes a tokenizer change visible, because it records the count rather than inferring it from a price. Without it, an upgrade that raises token counts by a third is indistinguishable from one that did nothing, since the rate card is the same in both cases.
02Per-model normalization. Compute effective cost — token count multiplied by the model's actual rate — rather than reporting the rate card. When a workload switches models, or a model's tokenizer changes underneath it, only a normalized effective-cost figure is comparable across the boundary. This is the layer that lets an organization compare the cost of the same work before and after a migration on a like-for-like basis.
03Outcome linkage. Attach every unit of spend to the outcome it served — a ticket identifier, a task identifier, a feature identifier — so that cost per resolved ticket, cost per completed task, and cost per shipped feature can be computed directly. This is the layer that answers the governance question, and it is the one standard dashboards omit, because aggregating spend by model or by team is straightforward while attributing it to value requires the system to carry the outcome identifier through to the billing record.

Cost observability is hard for reasons the design must respect rather than wish away. Analyses of LLM cost measurement identify recurring obstacles: the non-determinism of model output makes per-request cost variable even for identical inputs; data gaps between the application and the billing record break attribution; allocating shared infrastructure across teams is genuinely ambiguous; and model-switching — routing a request to whichever model is cheapest or most capable at the moment — complicates any per-model normalization. ²⁰ A design that ignores these obstacles is a wish list; the three-layer structure above is offered as the minimum that survives them, not as a claim that the problem is easy.

The single highest-leverage offset deserves its own note, because it interacts with the tokenizer vector in a way that is easy to miss. Prompt caching bills repeated context at a fraction of fresh input — cache reads are dramatically cheaper than cache writes — and at scale the savings are large; one widely reported figure has a major agentic coding workload saving hundreds of millions of tokens per week through caching. ²¹ The interaction is the trap. A tokenizer change can alter how text is split, which can invalidate the cache prefixes a workload relied on, which silently removes the caching discount at the same moment token counts rise. The two effects compound, and a cost-observable system must measure the cache hit rate across a migration specifically so that a lost discount is not mistaken for a tokenizer effect, or worse, not noticed at all. We flag that the cache-prefix-invalidation mechanism is logically sound but inferred rather than confirmed in vendor documentation, and treat it as a hypothesis to test rather than a settled fact.

Per-request instrumentation makes the tokenizer visible; per-model normalization makes migrations comparable; outcome linkage answers the governance question. The third layer is the one dashboards omit.

The design discipline

Section 07

The Regulated Overlay

For a regulated organization, a silent tokenizer change is not only a budget surprise; it is a change to a system parameter that several frameworks already oblige the organization to monitor. The framing in this section is deliberate and inferential: these obligations attach to systems that incorporate a model, not to model pricing as a regulated object, and we present the mappings as obligations to verify against current text rather than as settled determinations. We also keep the overlay distinct from the quality-and-safety overlay a companion paper develops: cost is a performance dimension, not a safety property, and conflating the two would overclaim.

For financial institutions, the model-risk-management expectations of SR 11-7 require ongoing monitoring of a model in production, including monitoring of changes to its operational parameters. ²² A tokenizer redesign that changes the relationship between an input and its billable representation is precisely such an operational-parameter change, and a monitoring regime that watches only the rate card would miss it. The NIST AI Risk Management Framework's Measure and Manage functions treat cost as a dimension of performance that an organization is expected to measure and manage; a 35% jump in effective cost from a model change is, in that frame, a measurable deviation that the Manage function exists to catch. ²³

Management-system and change-control frameworks reach the same place by a different route. ISO/IEC 42001, the AI management-system standard, contemplates change controls over the AI systems an organization operates, and a model upgrade that materially alters cost is a change those controls are meant to govern. ²⁴ Under SOC 2, the Common Criteria for change management at CC8.1 address the authorization and tracking of changes to a system; an upgrade that alters the system's cost profile is a change the criterion expects to be managed rather than absorbed silently. ²⁵

The hardest mapping is the most interesting, and we leave it deliberately open. For systems in scope of the EU AI Act, the question is whether a tokenizer redesign constitutes a “substantial modification” under the change-management provisions of Articles such as 72 and 25 — a determination that would carry documentation and conformity obligations. ²⁶ We do not assert that it does. We assert that it is a question a regulated operator must document the analysis of, rather than assume away, because a change that alters the system's economics by up to a third without a corresponding change to the published rate is exactly the kind of quiet modification a change-management regime exists to surface. The obligation, in each of these frames, reduces to the same discipline this paper argues for on economic grounds: measure the effective cost, attribute it, and treat a material change as the event it is.

Section 08

Three Checks Before a Model Upgrade

The three cost vectors each map to a check, and the three checks compose into a pre-migration discipline. None is exotic; each is the FinOps practice already applied to cloud spend, pointed at the specific mechanisms this paper has named. We frame them as recommended discipline rather than as a product or a guarantee.

01Token-density audit. Before migrating, run a representative sample of production requests through the candidate model's count_tokens endpoint and compare the token counts against the incumbent. Measure effective cost — count multiplied by rate — not the per-token price, because the price is the part that did not change. This is the only audit that surfaces a tokenizer increase before it appears on a bill, and it is cheap relative to the spend it protects.
02Agentic amplification audit. For any agentic workload, measure cost per completed task rather than cost per call, on the candidate model, before committing. The Uber failure mode — a budget consumed by turns that did not draw a line to features — is invisible at the per-call level and obvious at the per-task level. An upgrade that is cheaper per call but takes more turns to complete a task can be more expensive per outcome, and only the per-task measurement reveals it.
03Caching offset measurement. Measure the cache hit rate before and after the upgrade, because a tokenizer change can invalidate the cache prefixes a workload depends on and silently erase the caching discount at the same moment token counts rise. Treat a drop in cache hit rate as a first-class migration finding, not a footnote, since it compounds the per-request increase rather than offsetting it.

The three checks share a premise with the system design in Section 6: each requires per-request instrumentation, because each measures a token count rather than reads a rate card. An organization that has built the three observability layers can run all three checks as a matter of course before any upgrade; an organization that has not will discover the cost after it ships, which is the situation this paper exists to prevent. The checks are the operational form of the thesis: the rate card no longer predicts the bill, so measure the bill the only way that works — by counting the tokens and attaching them to the outcome — before the upgrade rather than after.

Audit the token density, measure cost per completed task, and check the cache hit rate — before you ship. The rate card will tell you nothing; the token count will tell you everything.

The pre-migration discipline

Section 09

Honest Limits

A paper that argued for measurement while overstating its own evidence would fail its own discipline. Here is what this paper does not establish.

The 35% figure is a ceiling, not a median. It is Anthropic's stated upper bound for the Opus 4.7 tokenizer's effect, and no independent median across real workloads has been published. ¹ The one independent measurement we cite found a range from roughly zero to the full 35%, which confirms the mechanism and its variability but does not establish a typical value. ³ We have written “up to 35%” throughout deliberately, and the argument does not depend on the ceiling being the norm — only on the effect being real, invisible to rate-based monitoring, and variable enough to require measurement.

Several pricing data points are secondary and flagged for verification. The GPT-5.5 rate we cite comes from a secondary source rather than from the provider's own pricing page, and should be verified against the primary before being relied upon. ⁷ The Anthropic enterprise pricing pivot is attributed to reporting by a paywalled outlet, relayed through secondary coverage; we cite the secondary coverage and note the chain of attribution rather than presenting it as confirmed first-party guidance. ⁹ The illustrative figure of a large, uncapped overrun that circulated in early 2026 is unattributed in its original reporting, and we have deliberately declined to name a company or treat the figure as verified.

Two technical claims are inferred rather than confirmed. The forecast of order-of-magnitude growth in agent usage is a vendor-cited projection, and we treat it as directional rather than as measured fact. ¹³ The cache-prefix-invalidation mechanism — that a tokenizer change can break the cache prefixes a workload relies on — is logically sound given how prefix caching and tokenization interact, but we have not confirmed it in vendor documentation, and we present it as a hypothesis a migration audit should test rather than as an established behavior. ²¹

The proposed cost-observable system in Section 6 is a design, not a product. We have described the layers such a system needs and the obstacles a real implementation must survive, drawing on what existing tooling does and does not do, but we have not built it, benchmarked it, or endorsed any vendor as its embodiment. ¹⁸ A reader should treat it as a specification of the problem's requirements, against which to evaluate tools, rather than as a claim that the requirements are met by anything in particular today.

The regulatory mappings in Section 7 are inferential and fact-specific. Whether SR 11-7's ongoing-monitoring expectations, the NIST AI RMF's Manage function, ISO/IEC 42001's change controls, SOC 2's CC8.1, or — most open of all — the EU AI Act's substantial-modification provisions apply to a particular system, and whether a tokenizer change triggers any of them, is a question that turns on the system's classification and the current text of each framework. ²⁶ We have framed each mapping as an obligation to verify, not as a determination this paper can make on a reader's behalf, and we have kept the cost overlay distinct from the quality-and-safety overlay rather than claiming a unified compliance conclusion.

The contribution of this paper is narrower and more durable than any single figure in it. It is the diagnosis that three structurally different cost shocks converged in early 2026, that their convergence decoupled the headline price from the effective cost, and that the decoupling makes cost observability — effective cost per outcome, measured and attributed — a governance artifact rather than a finance convenience. Every specific number may be revised as primary sources firm up; the structural claim survives the revision, because it rests on the mechanism rather than on the magnitude.

For the short, leadership-level version of this argument — the load-bearing claim, the worked examples, and the three-line checklist — read the companion brief, The Token You Didn't Count .

Not that we know the median — we do not — but that the mechanism is real, the rate card cannot see it, and the only instrument that can is the one this paper argues you now need: cost per outcome, measured before you ship.

The honest promise

The Token You Didn't Count

Context

The Finding