The Stakes
At the frontier tier, the models a procurement team is choosing between no longer separate cleanly on the benchmarks that used to rank them. On GPQA Diamond, a graduate-level reasoning test, the top cluster sits inside a single point: Claude Mythos Preview at 94.5, Opus 4.7 Adaptive at 94.2, and GPT-5.5 at 93.6, with the benchmark described as nearing saturation. MATH-500 scores cluster near 96% across the same tier. When the scores converge, the procurement question that organized the last buying cycle — “which model scores best?” — loses its resolution.
The convergence does not make switching cost go away. It relocates it. Quality parity moves the cost out of the model and into the agentic toolchain, the billing integration, and the undocumented workflows a team builds around whichever vendor it adopted first — none of which a quality benchmark measures. The result is a gap between what executives believe about their freedom to move and what they find when they try. In a survey of five hundred US executives, 89% believed they could switch AI vendors in under a month and 41% expected to do it in two to five days; among those who actually attempted a switch, only 42% said it went smoothly, and 58% found it failed or was harder than expected.
Quality parity does not eliminate switching cost. It makes switching cost invisible — until the contract is signed and the toolchain is woven in.
Why the Benchmarks Stopped Helping
The convergence is real, and it is scoped. A roughly thirty-day window in early 2026 saw a cluster of frontier releases — GPT-5.5, Opus 4.7, Gemini 3.5 Flash, DeepSeek V4 Pro, and Qwen 3.7 Max — arrive in close succession, and no single model dominates across the board. Leadership trades hands by task: one model leads on a code-arena Elo, another on an operating-system agent benchmark, another on a reasoning or abstraction test. The precise claim is narrow and worth stating exactly: on the procurement-grade benchmarks — GPQA, MMLU, SWE-bench — the frontier tier no longer discriminates. That is not the same as general equivalence between models.
Differentiation has not vanished; it has moved to dimensions the procurement benchmark never captured. On computer-use evaluations, the spread is wide and real: OSWorld-Verified scores range from Holo3 at 82.6 and Claude Mythos at 79.6 down to a significant gap below the frontier. Agentic capability now carries meaningful weight in composite benchmark scoring, and the tokenizer economics that determine the real cost of a workload vary by vendor in ways a quality score does not show — a thread the companion paper on hidden cost vectors develops.
The benchmark your procurement team relied on last year is now a noise floor. The differentiation moved to dimensions that benchmark never measured.
Where the Switching Cost Lives
When quality stops being the variable, the cost of changing vendors accumulates in five compounding surfaces that a benchmark cannot see: model behavior tuned to a specific vendor, workflow automation wired to a specific API, training and reference data shaped for one model, the compliance evidence assembled around a deployment, and the operational expertise a team builds operating it. Each surface is cheap to add and expensive to unwind, and none of them appears on the scorecard that drove the original decision.
The character of the migration has changed accordingly. As one practitioner put it, vendor migration is “no longer just an API migration” — it now involves context, workflows, and institutional memory, and most operators have not mapped any of it. The survey evidence corroborates the felt difficulty: 74% of the executives surveyed described themselves as reliant on or disrupted by their AI vendor relationship. And the extreme case is a reminder of the tail risk: the insolvency of one AI vendor left dependent customers exposed in ways an exit plan would have softened.
Nearly ninety percent of executives believed they could switch. Fewer than half who tried say it went smoothly. The gap is the switching cost the benchmark could not measure.
The Checklist: Eval, Inventory, Price
Because the old procurement instrument has stopped discriminating, the response is a new instrument, not a new vendor. Three practices restore a procurement signal under convergence. None prescribes a model; each recovers information the quality benchmark no longer provides.
- 01Run your own agentic eval. Quality benchmarks no longer discriminate at the frontier, so the discriminating signal has to come from your workload. Measure agentic reliability on the tasks you actually run — tool calling, multi-step execution, recovery from failure — rather than reading a GPQA delta as a procurement signal it can no longer carry.
- 02Inventory the five lock-in surfaces before you sign. Map model behavior, workflow automation, training and reference data, compliance evidence, and operational expertise before the commitment, not after. The switching cost is the sum of these surfaces; a team that has not inventoried them is procuring on the belief that switching is cheap — the belief the survey shows is usually wrong.
- 03Treat billing integration as infrastructure. Vendor-specific billing and credit models are a lock-in surface, not a line item. Price the workload under each vendor's tokenizer and pricing model, and treat the integration as infrastructure you will have to rebuild on exit — the cost vector the companion paper on hidden costs develops in full.
The companion in-depth paper develops each practice against its evidence base — the benchmark cluster, the market structure, and the survey on switching cost — and addresses the limits of the argument honestly. For the full structural analysis of where differentiation moved and how to procure under convergence, read the in-depth companion .
The Point
Convergence at the frontier is not a crisis. It is the result of a high and rising baseline: more than one vendor can now do the procurement-grade work well, which is a better problem than the one the industry had two years ago. The crisis, if there is one, is narrower and more specific. It is that teams are still measuring the old axis — benchmark quality — while the variable that actually determines the cost of a decision has moved to an axis they are not measuring at all.
The argument here is deliberately bounded. Benchmark saturation on GPQA, MMLU, and SWE-bench is not general equivalence; agentic and computer-use capability still spread widely. The causal link from convergence to lock-in is coherent and grounded in survey evidence, but it is an inference rather than a documented procurement case study, and the central survey is itself from a vendor in the orchestration market and should be read as directional. This paper recommends no vendor. It recommends a measurement: run your own agentic eval, inventory the lock-in surfaces, and price the integration before quality parity convinces you the choice does not matter.
Convergence is not the problem. Measuring the axis that converged, while ignoring the one that did not, is the problem.