Abstract
At the frontier tier, the benchmarks that procurement teams used to rank models have converged. On GPQA Diamond, the top cluster sits within a single point and the benchmark is described as nearing saturation. 1 When the scores stop separating vendors, the procurement signal that organized the last buying cycle — “which model scores best?” — loses its resolution. This paper makes a single structural argument: convergence does not remove switching cost, it relocates it into the agentic toolchain, the billing integration, and the institutional memory that no quality benchmark measures.
The argument has two halves. The first is empirical and well supported: the convergence on procurement-grade benchmarks is real, and the differentiation that remains has moved to agentic capability, tokenizer economics, and ecosystem depth. 6 The second is a causal inference, and we label it as one throughout: that convergence on the measured axis, combined with switching cost on the unmeasured axis, leaves procurement teams choosing on incomplete information. The central survey evidence is strong — nearly 90% of executives believed switching was fast, while a majority who tried found it was not — but the link from benchmark convergence to a specific bad decision is an inference, not a documented case study. 3
The paper proceeds in three movements. First, it establishes that the convergence is real and precisely scoped, and that differentiation has moved rather than disappeared. Second, it locates the instrument gap: quality-only evaluation programs measure what no longer discriminates, and switching cost has moved to an address the benchmark does not report. Third, it derives a procurement posture — agentic reliability on your own workflow, a switching-cost inventory before commitment, and the agentic toolchain treated as infrastructure — and closes with an explicit Honest Limits section. It recommends no vendor.
For the leadership-level version — the load-bearing claim, the survey gap, and the three-line checklist — read the companion brief, Capability Convergence and the Vendor Dependency Trap .
Quality parity does not eliminate switching cost. It makes switching cost invisible — until the contract is signed and the toolchain is woven in.
The Convergence Is Real
Begin with the evidence, because the rest of the argument depends on it being true and on being scoped exactly. On GPQA Diamond — a graduate-level science reasoning benchmark used as a frontier discriminator — the top cluster is separated by less than a point: Claude Mythos Preview at 94.5, Opus 4.7 Adaptive at 94.2, and GPT-5.5 at 93.6, with the benchmark itself described as nearing saturation. 1 The pattern repeats on MATH-500, where frontier-tier scores cluster near 96%. 2 A benchmark that no longer spreads its top contestants by a meaningful margin has stopped being a ranking instrument and become a threshold test: it tells you a model is at the frontier, not which frontier model to buy — a reading consistent with the published benchmark methodology and its own saturation note. 19
The release cadence reinforces the picture. A window of roughly thirty days in early 2026 saw a dense cluster of frontier releases — GPT-5.5, Opus 4.7, Gemini 3.5 Flash, DeepSeek V4 Pro, and Qwen 3.7 Max — arrive in close succession, the market structure of a field where multiple vendors reach comparable capability at nearly the same time. 4 21 And crucially, no single model dominates across the board. Leadership trades hands by task: a code-arena Elo around 1548 for one model, an operating-system agent benchmark led by another, a reasoning or abstraction test — GPQA, ARC-AGI-2 — led by a third. 5 20 The absence of a dominant model is itself a form of convergence: when the leader changes with the benchmark, no benchmark crowns a winner.
The scope of this claim must be stated precisely, because overstating it would be the easiest way to discredit the paper. The convergence is on the procurement-grade benchmarks — GPQA, MMLU, SWE-bench — that buying teams have historically used to compare frontier models. It is not a claim of general equivalence between models. Two models that score within a point on GPQA can behave very differently on a long-horizon agentic task, under a specific tokenizer, or inside a particular ecosystem — and the next section is about exactly those differences. The narrow claim is all the argument needs: the instruments procurement actually used have lost their discriminating power at the tier procurement actually shops in.
The procurement benchmarks no longer discriminate at the frontier tier. That is not general equivalence — it is the failure of a specific instrument at the specific job it was bought to do.
Where Differentiation Has Moved
Differentiation did not disappear when the quality benchmarks converged; it relocated to dimensions those benchmarks never captured. The clearest example is agentic capability. On computer-use evaluations, the spread is wide and real: OSWorld-Verified scores range from Holo3 at 82.6 and Claude Mythos at 79.6 down to a significant gap below the frontier — a spread that GPQA, measuring static reasoning, simply does not see. 6 Agentic capability now carries meaningful weight in composite benchmark scoring, which is to say the industry has begun pricing in a dimension that a single quality score elides. 6
The second dimension is economic. The tokenizer a vendor uses, and the pricing model it wraps around that tokenizer, determine the real cost of running a workload — and these vary by vendor in ways a quality score does not show. Two models at quality parity can differ substantially in cost-per-task once tokenization and pricing are accounted for. This is a switching-cost surface in its own right, and we develop it only as a cross-reference here, because the companion paper on hidden cost vectors carries the full treatment. 7
The third dimension is ecosystem depth, and it is the one most directly relevant to lock-in. The depth of a vendor's surrounding ecosystem — the tools, integrations, and platform features built around its models — is where switching cost accumulates fastest, because each integration is cheap to add and expensive to unwind. 8 Procurement has begun to recognize this implicitly: coding-tool evaluation has expanded to a 15-criteria matrix across some twenty platforms, a sign that buyers already sense the single quality score is no longer sufficient even if they have not named why. 9 23
The benchmark your procurement team relied on last year is now a noise floor. Differentiation moved to agentic reliability, tokenizer economics, and ecosystem depth — none of which the benchmark reports.
The Instrument Gap
This is the pivot of the paper, and the place where the argument crosses from the empirical to the inferential. We label the transition openly: a claim about benchmark scores is a fact about leaderboards; a claim about what that does to procurement decisions is an inference about behavior. The inference is this. If the benchmark that a procurement program runs measures only quality, and quality no longer discriminates at the frontier, then the program is measuring the one thing that has stopped varying and ignoring the things that have not. A team running a quality-only evaluation is, structurally, procuring on incomplete information — not because it is careless, but because its instrument was calibrated for a market that no longer exists.
The gap has a specific shape. The dimensions that now differentiate vendors — agentic reliability on the team's own workflow, the tokenizer economics of the team's own token mix, the depth of the ecosystem the team will actually build into — are precisely the dimensions a generic quality benchmark cannot measure, because they depend on the buyer's context rather than on the model in the abstract. 6 A leaderboard is a context-free measurement; procurement is an irreducibly context-bound decision. When the two were correlated — when a higher GPQA score reliably predicted a better outcome on your workload — the leaderboard was a serviceable proxy. As the correlation weakens under convergence, the proxy degrades into noise.
We are careful here about what we do and do not claim. We do not claim that any specific enterprise made a bad procurement decision because of this gap; we have no case study, and the Honest Limits section says so plainly. What we claim is structural: that an evaluation program built around an instrument that has lost its resolution will, on average, carry less information into the decision than its operators believe it does. This connects to two arguments KellerAI has made elsewhere — that the gap between evaluation and production is where models surprise their owners, and that the organizational capital a team builds around a model is not replicated by swapping in a better base model. 10 The instrument gap is where those two arguments meet: the eval measures the wrong thing, and the thing it fails to measure is exactly the thing that locks the team in.
A leaderboard is a context-free measurement. Procurement is a context-bound decision. As the two decouple, the instrument that used to serve as a proxy becomes a noise floor with a number on it.
Switching Cost Has Moved Address
This is the central claim, and it rests on the strongest survey evidence in the paper. In a survey of five hundred US executives, 89% believed they could switch AI vendors in under a month, and 41% believed they could do it in two to five days. 3 That is the belief. The reality, among executives who actually attempted a switch, was sharply different: only 42% said the switch went smoothly, while 58% reported it failed or was harder than expected, and 74% described themselves as reliant on or disrupted by their vendor relationship. 3 The 42%-versus-89% gap is the switching cost made visible: the distance between what executives believed about their freedom to move and what they encountered when they exercised it.
The mechanism behind the gap is a set of compounding lock-in surfaces. Five recur across the practitioner literature: model behavior tuned to a specific vendor, workflow automation wired to a specific API, training and reference data shaped for one model, the compliance evidence assembled around a deployment, and the operational expertise a team accumulates running it. 11 Each surface is individually modest and collectively decisive: none is expensive to add, all are expensive to unwind, and their sum is the switching cost that the quality benchmark never measured and the procurement team rarely inventoried. We label the link from these surfaces to the survey gap as an inference: the survey measures the felt difficulty, the literature names the surfaces, and the causal bridge between them is coherent rather than independently documented.
The character of the migration confirms the diagnosis. As one practitioner put it, switching vendors is “no longer just an API migration” — it involves context, workflows, and institutional memory, and most operators have not mapped any of it. 12 An API migration is a known quantity with a known cost; a migration of undocumented institutional memory is a cost you discover only by paying it. The tail of this distribution is the vendor that fails outright: the insolvency of one AI vendor left dependent customers exposed in ways an exit plan would have softened — an extreme case, labeled as such, but a reminder that the switching-cost surface and the continuity-risk surface are the same surface viewed from two angles. 13
Nearly ninety percent of executives believed they could switch in under a month. Fewer than half of those who tried say it went smoothly. The gap is the switching cost benchmarks cannot see.
Procurement Posture Under Convergence
If the old procurement instrument has lost its resolution, the response is a new instrument, not a new vendor. Three practices restore a procurement signal under convergence. They are recommendations derived by inference from the structure of the problem, not conclusions read off a controlled study, and we present them as such. None prescribes a vendor; each adds an axis the quality benchmark cannot supply.
- 01Measure agentic reliability on your own workflow. Because quality benchmarks no longer discriminate at the frontier, the discriminating signal must come from the buyer's context. Run a structured evaluation of agentic reliability — tool calling, multi-step execution, recovery from failure — on the tasks you actually run, and treat that result, not a GPQA delta, as the procurement signal. The public leaderboard tells you a model is at the frontier; only your own eval tells you which frontier model fits your workload.
- 02Inventory the switching-cost surfaces before commitment. Map the five lock-in surfaces — model behavior, workflow automation, training and reference data, compliance evidence, operational expertise — before signing, not after. Pre-register the expectation: at procurement time, write down what you expect each surface to cost on exit, so the eventual migration is measured against a recorded baseline rather than discovered by surprise. A team that has not inventoried these surfaces is procuring on the belief that switching is cheap.
- 03Treat the agentic toolchain as infrastructure. Billing integration, tokenizer economics, and ecosystem dependencies are infrastructure, not line items — and infrastructure is what you have to rebuild on exit. Price the workload under each vendor's tokenizer and pricing model, and weigh the depth of integration you are about to take on as a cost you will pay again if you leave. This is the cost vector the companion paper on hidden costs develops in full.
These two new axes — agentic reliability on your workflow and a switching-cost inventory before commitment — sit beside, not instead of, the quality threshold. The frontier baseline still matters; it has simply stopped being the variable. Existing market analyses point the same way: one maps vendor selection onto a trust-versus-flexibility plane rather than a single quality ranking, an implicit acknowledgment that the buying decision now has more than one dimension. 14 And procurement practice has already moved partway: the expansion of coding-tool evaluation to a 15-criteria matrix is buyers building, in effect, the multi-axis instrument this section argues for. 9
The inventory practice has a discipline borrowed from elsewhere in the KellerAI catalog: pre-register your expectations at procurement time, so the decision can be validated against a recorded baseline rather than rationalized after the fact. Writing down, before you sign, what you expect a switch to cost on each of the five surfaces converts an invisible cost into a recorded one — and a recorded cost is one a team can actually weigh. 15
The Agentic Toolchain as Infrastructure
The reason switching cost concentrates in the toolchain is that the toolchain is infrastructure, and infrastructure is what an organization builds around rather than swaps out. The agentic layer has developed its own protocol stack: multiple competing layers — the Model Context Protocol with its very large install base, agent-to-agent protocols under foundation stewardship, and other connection and coordination protocols — each of which a team wires into when it adopts a vendor's ecosystem. 16 22 The governance of that stack is its own subject, treated in the companion paper on the protocol stack; the point here is narrower. Every protocol integration is a thread that has to be cut and re-sewn on a switch, and the more threads, the higher the cost.
Billing integration is the second infrastructure surface, and it is vendor-specific by design. Vendors increasingly wrap their pricing in bespoke models — credit systems, dynamic pricing schemes, tiered plans — that a customer's finance and engineering functions build process around. 17 When the billing model is vendor-specific, the integration that consumes it is a switching-cost surface: the reconciliation logic, the budgeting assumptions, and the cost-attribution tooling all assume the incumbent vendor's billing shape, and all of it has to be rebuilt on exit. This is the same cost vector the hidden-cost companion treats in detail; here it appears as a lock-in surface rather than a pricing surprise. 7
The third surface is vendor-specific reasoning and orchestration infrastructure. A team that builds its agents around one vendor's reasoning model, its tool-use conventions, and its orchestration primitives accumulates organizational capital that is not portable — the prompt libraries, the evaluation harnesses, the operational runbooks, and the hard-won knowledge of how this particular model fails. This is the “thinking moat” viewed from the buyer's side: the organizational capital around a model that is not replicated by swapping in a better base model. 10 It is the most valuable thing a team builds and the least visible on any procurement scorecard, which is exactly why it is the surface where lock-in compounds most quietly.
Every protocol integration is a thread to cut and re-sew on a switch. Every vendor-specific billing model is process built around an incumbent. Infrastructure is, by definition, what you build around rather than swap.
Honest Limits
A paper that argued for a procurement posture while overstating its evidence would commit the very error it warns against. Here is what this paper does not establish.
There is no documented procurement case study behind the central causal claim. We have strong evidence that benchmarks have converged, and strong survey evidence that executives systematically underestimate switching cost — but we do not have a documented instance of a specific enterprise making a specific bad decision because its quality-only eval failed to capture lock-in. The chain from convergence to instrument gap to switching cost is coherent and grounded at each link, but the chain itself is an inference. We have written the argument so that removing any single claim does not collapse it: the convergence stands on benchmark data, the switching-cost gap stands on the survey, and the posture follows from either even without the other. 3
Benchmark saturation is not general equivalence, and we have been careful to scope the convergence claim to the procurement-grade benchmarks — GPQA, MMLU, SWE-bench — rather than to model capability in general. The agentic and computer-use evidence shows differentiation is alive and wide; a reader who concluded from this paper that frontier models are interchangeable would have read the opposite of its argument. 6 One input deserves a specific caution: where a benchmark figure rests on a model in preview rather than general availability, the general-availability anchors — Opus 4.7 Adaptive and GPT-5.5 — carry the convergence claim on their own, and the preview entry is corroborating rather than load-bearing. 1
The central survey is from a vendor, and we treat it as directional. The 89%-versus-42% figures come from an orchestration-platform company with a commercial interest in the salience of vendor lock-in; the magnitude may be inflated by the framing of the questions or the composition of the sample. 3 26 We rely on the direction of the finding — that belief about switching ease substantially exceeds realized switching ease — rather than on the precise percentages, and the argument survives a considerable discount to the magnitude. Several supporting sources are practitioner blogs and market-analysis sites rather than peer-reviewed work; we have used them for the structure of the problem (the five surfaces, the protocol layers, the procurement matrix) rather than for contested quantitative claims. 11 18 24
This paper recommends no vendor and prescribes no product. The posture in Section 6 describes a measurement discipline — eval on your own workflow, inventory the surfaces, price the integration — not a choice between models, and the pricing and evaluation specifics it gestures at live in the companion papers on hidden cost and the evaluation-to-production gap rather than here. 7 10 The contribution is a single structural observation: when quality stops separating vendors, the instruments that measured quality stop working, and the switching cost that accumulates in the gap is invisible until it is not. A reader who accepts only the convergence and only the survey can still adopt every recommendation in Section 6, which is itself summarized in the companion brief. 25
For the short, leadership-level version of this argument — the load-bearing claim, the survey gap, and the Eval / Inventory / Price checklist — read the companion brief, Capability Convergence and the Vendor Dependency Trap .
Not that any single team made the wrong call — we cannot show that — but that an instrument calibrated for a market that no longer exists carries less information than the team relying on it believes.
The Point
Convergence at the frontier is good news read correctly. It means more than one vendor can now do the procurement-grade work well, a baseline high enough that the choice of model is no longer the dominant risk in a deployment. The risk has moved, and the failure mode this paper names is the failure to move with it: continuing to measure benchmark quality — the axis that converged — while the axis that now determines the cost of a decision, switching cost, goes unmeasured.
The remedy is a measurement, not a vendor. Run your own agentic evaluation on your own workflow, because the public benchmark has stopped discriminating at the tier you shop in. Inventory the five lock-in surfaces before you sign, because the switching cost is their sum and it is invisible until you try to leave. And treat the agentic toolchain — protocols, billing, orchestration — as the infrastructure it is, priced as a cost you will pay again on exit. None of this prescribes a model; all of it restores a procurement signal the quality benchmark can no longer carry.
The single sentence the whole paper reduces to is this: when quality stops separating vendors, the procurement instruments that measured quality stop working — and the switching cost that accumulates in the gap is invisible until the contract is signed and the toolchain is woven in. Convergence is not the trap. Measuring the axis that converged, while ignoring the one that did not, is the trap.
Convergence is not the trap. Measuring the axis that converged, while ignoring the one that did not, is the trap.