TL;DR: Benchmark contamination is real and measurable. Scale AI’s GSM1k study showed the worst-performing models dropping 13 percentage points on a rebuilt set. But the deeper failure is that capability evaluation has only ever measured correlation with a test distribution. Harder benchmarks reset the clock. They don’t introduce verification, and verification is what’s actually missing.
If you’ve been reading model-release blog posts for a while, the table on page one starts looking familiar. Classic benchmarks near the top, newer harder ones below, every number a hair better than the last generation’s. The explanation everyone reaches for is the saturation story: older benchmarks got too easy, build harder ones, repeat. HLE, LiveCodeBench, FrontierMath, MMLU-Pro all live inside that story.
Most of it is fine, honestly. I don’t want to spend a whole post complaining about a habit that does buy time. The thing is, the more I sit with the recent leaderboards next to the GSM1k study from a couple of years back, the more I think the saturation story leaves out the piece that actually keeps the cycle running. Which is what I want to walk through here.
The story everyone tells
Let me lay out the standard argument properly first, because the part of it that’s right is doing real work.
It runs roughly like this. Classic benchmarks like MMLU and GSM8k saturated. Frontier scores on MMLU now cluster in an 88-94% band, narrow enough that the ranking differences inside it are mostly noise. GSM8k is functionally solved, with frontier coding models sitting around 99%. HumanEval is in the same neighborhood. The fix everyone reaches for is to design harder, more current evaluations. Humanity’s Last Exam (HLE) holds 2,500 graduate-and-beyond questions across 100+ subjects, with human experts averaging around 90%. LiveCodeBench pulls weekly contest problems after each model’s training cutoff. Run those, get a clean signal, swap them when they saturate too.
The steelman is real. Saturation does mean something. Contamination-resistant designs do produce harder signals. The community has bought itself two productive years this way, which isn’t nothing.
Where it stops working
HLE was designed in 2025 to stump frontier reasoning, and by late May 2026 several frontier models are already sitting in the mid-40s on the Artificial Analysis leaderboard. Which, side note: that was fast. Human experts still average around 90% in their own domains, but the gap that looked enormous a year ago is visibly closing.
The “headroom” was never really a property of the benchmark. It was just the gap between current models and the ceiling. Difficulty buys you time. It doesn’t buy you a different kind of measurement, and the cycle keeps quietly asking for one.
What GSM1k actually showed
If you want one piece of evidence that this is structural and not just “we picked bad benchmarks,” it’s the GSM1k study. Scale AI rebuilt 1,250 grade-school math problems matched in style and difficulty to GSM8k, then re-ran a wide model set. The abstract has the headline: the worst-performing model dropped 13 percentage points on the new set. That’s the number that travels.
The one I keep going back to though, and I’ll admit I skimmed past it the first time, is the one a sentence or two later: a Spearman correlation of r² = 0.32 between a model’s probability of generating GSM8k samples and its GSM1k-vs-GSM8k gap. Mistral and Phi families showed consistent overfitting across versions and sizes. Llama2 and the contemporary frontier models did not.
Plain reading: the more a model could regurgitate GSM8k, the better it looked on GSM8k and the worse it looked on a fresh set of equivalent difficulty. The 13 points is the headline. The 0.32 is the thing that says something about what the score actually is.
Why frontier-models-survived isn’t reassuring
The reading most people take from GSM1k, that frontier models held up, gets put down as a relief. I had to read it twice before I stopped reading it that way, so I get the instinct. But I don’t think the relief is earned, and the reason is a little subtle.
Frontier models holding up on rebuilt grade-school math doesn’t mean they weren’t trained on GSM8k. It means their underlying capability already exceeded the GSM8k ceiling, so whatever memorization existed couldn’t lift the score any further. Above the ceiling, memorization and competence converge to the same number. So “no crash” is closer to “this benchmark stopped being informative for the models you actually care about” than to “this benchmark is sound.” Which, if you squint, is just the saturation argument again, dressed differently.
The general shape, if I had to name it in one line: once a benchmark saturates, the score loses the ability to tell memorization apart from competence at the top, and you can’t recover that separation by staring at the same score harder.
What we’ve actually been measuring
This is the part I sat with the longest, because it’s not obvious until you put it next to a few other things, and then it kind of becomes the only thing you can see.
Benchmark scores have always been correlation, not verification. You measure how often a model produces the gold answer on a held-out distribution, and that correlates with capability as long as the items weren’t seen, the items are independent, and ranking differences exceed noise. When any of those conditions breaks (contamination, near-duplicates, saturation noise), the correlation degrades quietly. The number on the chart keeps climbing.
We never actually had a way to confirm a model learned a thing. Only a way to confirm it has seen enough of the thing-shaped distribution. I think the blog has been bumping into this shape from a couple of directions: for agents in No evidence, no completion, where a confident agent report isn’t the same as a confirmed task; and for protocols in MCP security, where “the protocol allows it” got mistaken for “it’s safe.” Benchmarks turn out to be another instance of the same thing. A convention treated as evidence, until the convention breaks.
Why “build a harder one” doesn’t fix it
Harder benchmarks address the symptom (saturation), not the disease. They give you a higher ceiling and more discrimination at the top, and they don’t introduce verification. The moment a harder benchmark is public it enters the data stream that trains the next generation. LiveCodeBench-style time-slicing helps a lot (paper), because problems published after the cutoff are by construction unseen — but only for newly trained models. For any frozen checkpoint, today’s time-slice eventually becomes a static benchmark too.
The reframe I’d push, if anything: capability evaluation probably isn’t one artifact you build, score against, and ship. It’s an ongoing protocol with verification baked in. Nothing widely deployed has that yet. Time-sliced benchmarks and private holdouts are the closest analogues, and they’re both partial answers at best.
How to read a 2026 leaderboard
Mostly: look at the absolute number last.
The questions I’ve found more useful, in roughly the order I run them: when were the items released relative to the model’s training cutoff? If they’re older, the score is suspect by default. If there’s a private split, what’s the gap to the public number? A wide gap is contamination smoke. How does a model’s score behave between a saturated benchmark and a contamination-resistant one? A model near the ceiling on MMLU but flat on LiveCodeBench is telling you something about where its lift came from.
The other habit I’ve half-developed (still working on it, honestly) is to stop letting a single score describe a model’s capability for me. Two models at 92 and 99 on the same saturated benchmark might be indistinguishable on your actual task, or wildly apart. The benchmark won’t tell you which. You have to point them at the task and see, which is annoying, but I haven’t found a shortcut.
What honest evaluation would even look like
The closest analogy I keep coming back to is how good engineering treats correctness claims: tests written by people who aren’t the implementation, on cases the implementation didn’t get to peek at, with the reasoning checked, not just the final answer. None of that is anywhere near production-ready at frontier scale, and the labs all know it. So I’m not pretending there’s a simple drop-in fix.
The honest near-term answer is a little uncomfortable. Benchmarks aren’t going away. They’re still the cheapest way the field has to compare notes, and they’re useful as long as you don’t load too much on them. If a score stops being a capability claim and starts being one of several lossy signals you weigh against the actual task in front of you, the leaderboard goes from misleading to just lossy. Lossy is something you can live with. Just don’t forget that’s what it is.
Read next
- No evidence, no completion: a verification principle for AI agents — the same governance argument applied to agent task completion.
- MCP security isn’t a protocol bug. It’s a governance problem. — the convention-vs-evidence gap, on the protocol side.
- Why AI agents fail without governance — context on why verification stops being optional once models act.
- Benchmark 飽和的真正問題:不在測量,在驗證 (Chinese companion) — independent companion piece in Traditional Chinese.