What is benchmark contamination?

Training data contains the test items themselves, or near-duplicates. Scores then reflect memorization rather than reasoning. Scale AI's GSM1k study rebuilt 1,205 grade-school math problems matched in difficulty to GSM8k, and observed accuracy drops of up to 8 percentage points on the new set, with the worst-affected model families showing systematic overfitting.

Why didn't frontier models crash on GSM1k?

Their underlying capability already exceeded the GSM8k ceiling, so any memorization of individual items added little to their final score. That means contamination affected them less, not that the benchmark is sound.

Aren't contamination-resistant benchmarks the fix?

They delay the same problem. LiveCodeBench pulls fresh weekly contest problems from LeetCode, Codeforces, and AtCoder, which works for newly released models. For any frozen checkpoint, today's LiveCodeBench will eventually become a static benchmark too.

How should you read a 2026 leaderboard?

Care less about absolute scores. Care about whether problems post-date the training cutoff, whether a private split exists, and how a model's score gap behaves between saturated benchmarks and contamination-resistant ones.

LLM Benchmark Saturation Is a Verification Problem

TL;DR: Benchmark contamination is real and measurable. Scale AI’s GSM1k study showed accuracy drops of up to 8 percentage points on a rebuilt set, concentrated in the model families that had overfitted. But the deeper failure is that capability evaluation has only ever measured correlation with a test distribution. Harder benchmarks reset the clock. They don’t introduce verification, and verification is what’s actually missing.

If you’ve been reading model-release blog posts for a while, the table on page one starts looking familiar. Classic benchmarks near the top, newer harder ones below, every number a hair better than the last generation’s. The explanation everyone reaches for is the saturation story: older benchmarks got too easy, build harder ones, repeat. HLE, LiveCodeBench, FrontierMath, MMLU-Pro all live inside that story.

Most of it is fine, honestly. I don’t want to spend a whole post complaining about a habit that does buy time. The thing is, the more I sit with the recent leaderboards next to the GSM1k study from a couple of years back, the more I think the saturation story leaves out the piece that actually keeps the cycle running. Which is what I want to walk through here.

The story everyone tells

Let me lay out the standard argument properly first, because the part of it that’s right is doing real work.

It runs roughly like this. Classic benchmarks like MMLU and GSM8k saturated. Frontier scores on MMLU now cluster in an 88-94% band, narrow enough that the ranking differences inside it are mostly noise. GSM8k is functionally solved: the top model on public leaderboards sits near 99% and the rest of the frontier clusters in the mid-to-high 90s. HumanEval is in the same neighborhood. The fix everyone reaches for is to design harder, more current evaluations. Humanity’s Last Exam (HLE) holds 2,500 graduate-and-beyond questions across 100+ subjects, each with an answer an expert can verify but a search engine can’t retrieve. LiveCodeBench pulls weekly contest problems after each model’s training cutoff. Run those, get a clean signal, swap them when they saturate too.

The steelman is real. Saturation does mean something. Contamination-resistant designs do produce harder signals. The community has bought itself two productive years this way, which isn’t nothing.

Where it stops working

HLE was designed in 2025 to stump frontier reasoning, and by late May 2026 several frontier models are already sitting in the mid-40s on the Artificial Analysis leaderboard. HLE publishes no human-expert baseline to measure that against, but the headroom that looked enormous a year ago is visibly closing.

The “headroom” was never really a property of the benchmark. It was just the gap between current models and the ceiling. Difficulty buys you time. It doesn’t buy you a different kind of measurement, and the cycle keeps quietly asking for one.

What GSM1k actually showed

If you want one piece of evidence that this is structural and not just “we picked bad benchmarks,” it’s the GSM1k study. Scale AI rebuilt 1,205 grade-school math problems matched in style and difficulty to GSM8k, then re-ran a wide model set. The abstract has the headline: accuracy drops of up to 8 percentage points on the new set. That’s the number that travels. (An early preprint said 13; a later revision brought it down to 8, and the higher figure still circulates.)

A sentence or two later, there’s a Spearman r² of 0.36 between a model’s probability of generating GSM8k samples and its GSM1k-vs-GSM8k gap. Mistral and Phi families showed consistent overfitting across versions and sizes. Llama2 and the contemporary frontier models did not.

Plain reading: the more a model could regurgitate GSM8k, the better it looked on GSM8k and the worse it looked on a fresh set of equivalent difficulty. The 8 points is the headline. The 0.36 is the thing that says something about what the score actually is.

Why frontier-models-survived isn’t reassuring

The reading most people take from GSM1k, that frontier models held up, gets put down as a relief. But I don’t think the relief is earned, and the reason is a little subtle.

Frontier models holding up on rebuilt grade-school math doesn’t mean they weren’t trained on GSM8k. It means their underlying capability already exceeded the GSM8k ceiling, so whatever memorization existed couldn’t lift the score any further. Above the ceiling, memorization and competence converge to the same number. So “no crash” is closer to “this benchmark stopped being informative for the models you actually care about” than to “this benchmark is sound.” Which, if you squint, is just the saturation argument again, dressed differently.

Once a benchmark saturates, the score loses the ability to tell memorization apart from competence at the top, and you can’t recover that separation by staring at the same score harder.

What we’ve actually been measuring

Benchmark scores have always been correlation, not verification. You measure how often a model produces the gold answer on a held-out distribution, and that correlates with capability as long as the items weren’t seen, the items are independent, and ranking differences exceed noise. When any of those conditions breaks (contamination, near-duplicates, saturation noise), the correlation degrades quietly. The number on the chart keeps climbing.

We never actually had a way to confirm a model learned a thing. Only a way to confirm it has seen enough of the thing-shaped distribution. I think the blog has been bumping into this shape from a couple of directions: for agents in No evidence, no completion, where a confident agent report isn’t the same as a confirmed task; and for protocols in MCP security, where “the protocol allows it” got mistaken for “it’s safe.” Benchmarks turn out to be another instance of the same thing.

Why “build a harder one” doesn’t fix it

Harder benchmarks address the symptom (saturation), not the disease. They give you a higher ceiling and more discrimination at the top, and they don’t introduce verification. The moment a harder benchmark is public it enters the data stream that trains the next generation. LiveCodeBench-style time-slicing helps a lot (paper), because problems published after the cutoff are by construction unseen — but only for newly trained models. For any frozen checkpoint, today’s time-slice eventually becomes a static benchmark too.

The reframe I’d push, if anything: capability evaluation probably isn’t one artifact you build, score against, and ship. It’s an ongoing protocol with verification baked in. Nothing widely deployed has that yet. Time-sliced benchmarks and private holdouts are the closest analogues, and they’re both partial answers at best.

How to read a 2026 leaderboard

Mostly: look at the absolute number last.

The questions I’ve found more useful, in roughly the order I run them: when were the items released relative to the model’s training cutoff? If they’re older, the score is suspect by default. If there’s a private split, what’s the gap to the public number? A wide gap is contamination smoke. How does a model’s score behave between a saturated benchmark and a contamination-resistant one? A model near the ceiling on MMLU but flat on LiveCodeBench is telling you something about where its lift came from.

The other habit I’ve half-developed (still working on it, honestly) is to stop letting a single score describe a model’s capability for me. Two models at 92 and 99 on the same saturated benchmark might be indistinguishable on your actual task, or wildly apart. The benchmark won’t tell you which. You have to point them at the task and see, which is annoying, but I haven’t found a shortcut.

What honest evaluation would even look like

The closest analogy I keep coming back to is how good engineering treats correctness claims: tests written by people who aren’t the implementation, on cases the implementation didn’t get to peek at, with the reasoning checked, not just the final answer. None of that is anywhere near production-ready at frontier scale, and the labs all know it. So I’m not pretending there’s a simple drop-in fix.

The honest near-term answer is a little uncomfortable. Benchmarks aren’t going away. They’re still the cheapest way the field has to compare notes, and they’re useful as long as you don’t load too much on them. If a score stops being a capability claim and starts being one of several lossy signals you weigh against the actual task in front of you, the leaderboard goes from misleading to just lossy — which you can work with, as long as you remember that’s all it is.

The story everyone tells#

Where it stops working#

What GSM1k actually showed#

Why frontier-models-survived isn’t reassuring#

What we’ve actually been measuring#

Why “build a harder one” doesn’t fix it#

How to read a 2026 leaderboard#

What honest evaluation would even look like#

Read next#

More in this thread

What Are GPT-5.6's Sol, Terra, and Luna?

Claude Fable 5: First Public Mythos-Class Model, One Day In

The Same GLM 5.2 Has Different Prices Across Providers

JSON formatter: format, validate, and debug JSON