Benchmarks on KbWen Blog

Benchmark 飽和的真正問題：不在測量，在驗證

KbWen — Mon, 01 Jun 2026 10:00:00 +0800

TL;DR：GSM1k 研究指出 benchmark 飽和有一大塊是污染，不是真實能力提升。但比污染更值得想的是：我們從來沒有方法驗證模型「學會了一件事」，只有方法量它「在這個分布上會不會答」。每出一份更難的 benchmark，治理面其實沒前進。

每次新模型發表的 blog 我都會點開看一下，幾乎都長同一張表。GSM8k 99%、MMLU 92%、HumanEval 衝到接近 100。看久了會覺得這是某種 ritual，每代都會再比上一代好看一點。

可是把同一個模型放回真實工作裡，丟一份沒進 GitHub 的內部 codebase、丟一份它沒看過格式的會議筆記，它還是會犯那種會讓你嘆氣的錯。這件落差其實已經不是新聞了，奇怪的是每一輪 release blog 還是把分數寫成 state of the art，我每次看到都會有點 ???。我自己在這幾輪 release 之間，慢慢把這個怪怪的感覺磨成一個比較確定的想法：飽和大概不是測量問題，是我們一直沒解過的驗證問題。下面就是這個想法是怎麼長出來的，講起來有點繞，請容忍我一邊想一邊寫。

飽和到底是什麼意思

「分數沒地方爬了」就叫飽和。MMLU 在前沿模型上落在 88% 到 94% 這個窄帶，這個區間裡誰高誰低很大機率只是 noise。GSM8k 上前沿模型已經拿到 99% 上下，再進 0.5 個百分點也沒什麼故事可講。能力提升是真的，問題只是 benchmark 已經不在追蹤它本來要追蹤的那把尺了。

一把尺到頂的時候，你不會看到尺壞掉，你只會看到分數還在漲。刻度跟它後面那個能力之間在這個高度悄悄脫了鉤。直覺上下一步當然會想到「再做一把更長的尺」，這個直覺沒問題，只是這條路後面會撞到結構問題，我們等一下會繞回來。

用得上的證據：GSM1k

Scale AI 在 2024 年做了一份 GSM1k 研究，重出 1,250 題、難度跟 GSM8k 對齊的小學數學題，然後重跑一輪。abstract 的數字很乾淨：表現最差的家族在 GSM1k 上掉了 13 個百分點。這個是會出現在所有摘要裡的那個 13。

我自己比較喜歡的是後面那個比較少人引用的數字，老實說第一次讀我也跳過了，第二次回去翻才看到：模型「生成 GSM8k 樣本的機率」跟「GSM1k 與 GSM8k 之間的分數落差」有 Spearman 相關，r² = 0.32。

換成人話就是：越記得 GSM8k 原題的模型，在 GSM8k 上越好看，在重出的同難度題上就越糟。Mistral 跟 Phi 兩家被點名，幾乎每個版本都有過擬合的痕跡；Llama2 跟當時的前沿模型則沒事。13 是表頭那個數字，0.32 才是說明「分數實際上在量什麼」的那個數字。

前沿沒崩，不太代表 benchmark 沒問題

很多人讀完前段那句「前沿沒崩」會鬆一口氣。我自己讀的時候也是先鬆了一下，過幾分鐘才覺得不對，原因有一點點繞，我儘量講清楚。

前沿模型在 GSM1k 上沒崩，不一定代表它們沒看過 GSM8k。比較準的解讀是：它們的能力上限已經高過這份題的天花板，所以單題記不記得對最終分數的邊際貢獻歸零了。在這個高度，污染跟能力會收斂到同一個分數。

所以「沒崩」其實在說的是「這份 benchmark 對前沿來說已經沒鑑別力了」。這跟飽和那個結論本身比較接近，跟「benchmark 沒被污染」不是同一回事。順手對齊一個我之前寫過的角度：《大語言模型 LLM：其實做的事情比你想像中更單純》裡講過，模型在做的就是 next-token prediction，沒有獨立的「我學會了」這個內部狀態。從外面我們只看得到輸出對不對，沒辦法跨進去看它是怎麼對的。所以分數沒崩，不等於我們知道它怎麼答出來的。

再出一份更難的就會解決嗎

業界當下的反射動作就是這個。Humanity’s Last Exam（HLE）2,500 題、跨 100 多個學科，Artificial Analysis 的 leaderboard 截至 2026 年 5 月底前沿模型已經進到 40 分檔；專家人類大約落在 90%。看起來空間還很大，可是這個空間正在以肉眼可見的速度被吃掉。對，那速度是真的有點快。

LiveCodeBench 走另一條路，從 LeetCode、Codeforces、AtCoder 收每週新題，按發布時間切片（paper）。這比靜態 benchmark 更接近驗證的形狀，但它做的其實是把時鐘往後推。對任何一個 frozen 的模型，今天的 LiveCodeBench 在它眼裡終究也會變成一份靜態題。

更難跟更新都是「延後」這個 framing 的同一種操作。後面那層結構問題沒被它解開，這也是我接下來想拉開來看的事。我寫到這邊會有點怕變抽象，所以講慢一點。

那層結構問題

我們從來就沒有方法去驗證模型「學會了一件事」，只有方法量它「在這個分布上會不會答」。這兩件事在 benchmark 沒污染、題目沒被看過、ranking 差距大於 noise 的時候會收斂在一起，所以平常我們不太需要分清楚。可是只要任何一個條件壞掉，相關性就靜悄悄退化，數字卻照樣漲。然後我們會繼續引用那個數字。

這個形狀我之前在 MCP 那篇用「設計如此不是卸責」討論過。Benchmark 是同一個形狀的另一面：把「分數高」當「能力強」，跟把「協定允許」當「行為安全」一樣，都是把一個方便的約定當證據在用。約定在你眼前的時候很方便，等到污染、prompt injection、agent 自主行為這類事情冒出來，你才會回頭發現整個堆疊裡其實沒有一層真的在驗證。「No evidence, no completion」那篇對 agent 的版本是：confident 報告不等於 confirmed 工作。Benchmark 的版本一樣：高分不直接等於能力被驗證過。

寫到這裡其實我也覺得有點繞，但這就是我目前能想到最誠實的講法。

那 2026 年的 leaderboard 還要不要看

要看，只是花在絕對數字上的時間可以少一點。

我自己變得比較在意這幾件事，大概照這個順序在心裡跑。題目的發布時間有沒有晚於模型 cut-off？有沒有 private split，public 跟 private 差距多大？同一個模型在「已飽和的 benchmark」跟「contamination-resistant benchmark」之間落差怎麼樣？前者撞天花板、後者跟不上的那個 pattern，比 leaderboard 最上面那一行有用得多。

另一個半養成的習慣（還在養，老實說）是：不要再用單一分數去描述一個模型「會什麼」。一個 99 跟一個 92 的模型在你今天要做的事情上，可能差很大、也可能完全沒差，這件事 benchmark 不會告訴你。你還是得把它對到你手邊那個任務上實際試一輪，沒什麼捷徑，這點稍微有點煩，但目前是這樣。

寫到這裡

Benchmark 不是壞東西。它給研究一個共同尺度、給溝通一個最低成本，這個我沒有要否定。比較是它被當成「能力代理」用得太順手、太久，我們忘了它原本只是分布上的一個切面而已。

GSM1k 那篇 paper 已經兩年了，業界對飽和的標準動作仍然是「再出一份更難的」。方向沒錯，可是這條路怎麼走都會繞回同一個地方。我自己看完一圈之後留在頭上的問題是：怎麼說服自己一個模型是「會」而不只是「答對」？我沒有完整答案，這篇也不打算假裝有，但這個問題會被我帶著去看下一份 release blog，至少不會再被表頭那個 99% 直接收編。

LLM Benchmark Saturation Isn't a Measurement Problem

KbWen — Mon, 01 Jun 2026 10:00:00 +0800

TL;DR: Benchmark contamination is real and measurable. Scale AI’s GSM1k study showed the worst-performing models dropping 13 percentage points on a rebuilt set. But the deeper failure is that capability evaluation has only ever measured correlation with a test distribution. Harder benchmarks reset the clock. They don’t introduce verification, and verification is what’s actually missing.

If you’ve been reading model-release blog posts for a while, the table on page one starts looking familiar. Classic benchmarks near the top, newer harder ones below, every number a hair better than the last generation’s. The explanation everyone reaches for is the saturation story: older benchmarks got too easy, build harder ones, repeat. HLE, LiveCodeBench, FrontierMath, MMLU-Pro all live inside that story.

Most of it is fine, honestly. I don’t want to spend a whole post complaining about a habit that does buy time. The thing is, the more I sit with the recent leaderboards next to the GSM1k study from a couple of years back, the more I think the saturation story leaves out the piece that actually keeps the cycle running. Which is what I want to walk through here.

The story everyone tells

Let me lay out the standard argument properly first, because the part of it that’s right is doing real work.

It runs roughly like this. Classic benchmarks like MMLU and GSM8k saturated. Frontier scores on MMLU now cluster in an 88-94% band, narrow enough that the ranking differences inside it are mostly noise. GSM8k is functionally solved, with frontier coding models sitting around 99%. HumanEval is in the same neighborhood. The fix everyone reaches for is to design harder, more current evaluations. Humanity’s Last Exam (HLE) holds 2,500 graduate-and-beyond questions across 100+ subjects, with human experts averaging around 90%. LiveCodeBench pulls weekly contest problems after each model’s training cutoff. Run those, get a clean signal, swap them when they saturate too.

The steelman is real. Saturation does mean something. Contamination-resistant designs do produce harder signals. The community has bought itself two productive years this way, which isn’t nothing.

Where it stops working

HLE was designed in 2025 to stump frontier reasoning, and by late May 2026 several frontier models are already sitting in the mid-40s on the Artificial Analysis leaderboard. Which, side note: that was fast. Human experts still average around 90% in their own domains, but the gap that looked enormous a year ago is visibly closing.

The “headroom” was never really a property of the benchmark. It was just the gap between current models and the ceiling. Difficulty buys you time. It doesn’t buy you a different kind of measurement, and the cycle keeps quietly asking for one.

What GSM1k actually showed

If you want one piece of evidence that this is structural and not just “we picked bad benchmarks,” it’s the GSM1k study. Scale AI rebuilt 1,250 grade-school math problems matched in style and difficulty to GSM8k, then re-ran a wide model set. The abstract has the headline: the worst-performing model dropped 13 percentage points on the new set. That’s the number that travels.

The one I keep going back to though, and I’ll admit I skimmed past it the first time, is the one a sentence or two later: a Spearman correlation of r² = 0.32 between a model’s probability of generating GSM8k samples and its GSM1k-vs-GSM8k gap. Mistral and Phi families showed consistent overfitting across versions and sizes. Llama2 and the contemporary frontier models did not.

Plain reading: the more a model could regurgitate GSM8k, the better it looked on GSM8k and the worse it looked on a fresh set of equivalent difficulty. The 13 points is the headline. The 0.32 is the thing that says something about what the score actually is.

Why frontier-models-survived isn’t reassuring

The reading most people take from GSM1k, that frontier models held up, gets put down as a relief. I had to read it twice before I stopped reading it that way, so I get the instinct. But I don’t think the relief is earned, and the reason is a little subtle.

Frontier models holding up on rebuilt grade-school math doesn’t mean they weren’t trained on GSM8k. It means their underlying capability already exceeded the GSM8k ceiling, so whatever memorization existed couldn’t lift the score any further. Above the ceiling, memorization and competence converge to the same number. So “no crash” is closer to “this benchmark stopped being informative for the models you actually care about” than to “this benchmark is sound.” Which, if you squint, is just the saturation argument again, dressed differently.

The general shape, if I had to name it in one line: once a benchmark saturates, the score loses the ability to tell memorization apart from competence at the top, and you can’t recover that separation by staring at the same score harder.

What we’ve actually been measuring

This is the part I sat with the longest, because it’s not obvious until you put it next to a few other things, and then it kind of becomes the only thing you can see.

Benchmark scores have always been correlation, not verification. You measure how often a model produces the gold answer on a held-out distribution, and that correlates with capability as long as the items weren’t seen, the items are independent, and ranking differences exceed noise. When any of those conditions breaks (contamination, near-duplicates, saturation noise), the correlation degrades quietly. The number on the chart keeps climbing.

We never actually had a way to confirm a model learned a thing. Only a way to confirm it has seen enough of the thing-shaped distribution. I think the blog has been bumping into this shape from a couple of directions: for agents in No evidence, no completion, where a confident agent report isn’t the same as a confirmed task; and for protocols in MCP security, where “the protocol allows it” got mistaken for “it’s safe.” Benchmarks turn out to be another instance of the same thing. A convention treated as evidence, until the convention breaks.

Why “build a harder one” doesn’t fix it

Harder benchmarks address the symptom (saturation), not the disease. They give you a higher ceiling and more discrimination at the top, and they don’t introduce verification. The moment a harder benchmark is public it enters the data stream that trains the next generation. LiveCodeBench-style time-slicing helps a lot (paper), because problems published after the cutoff are by construction unseen — but only for newly trained models. For any frozen checkpoint, today’s time-slice eventually becomes a static benchmark too.

The reframe I’d push, if anything: capability evaluation probably isn’t one artifact you build, score against, and ship. It’s an ongoing protocol with verification baked in. Nothing widely deployed has that yet. Time-sliced benchmarks and private holdouts are the closest analogues, and they’re both partial answers at best.

How to read a 2026 leaderboard

Mostly: look at the absolute number last.

The questions I’ve found more useful, in roughly the order I run them: when were the items released relative to the model’s training cutoff? If they’re older, the score is suspect by default. If there’s a private split, what’s the gap to the public number? A wide gap is contamination smoke. How does a model’s score behave between a saturated benchmark and a contamination-resistant one? A model near the ceiling on MMLU but flat on LiveCodeBench is telling you something about where its lift came from.

The other habit I’ve half-developed (still working on it, honestly) is to stop letting a single score describe a model’s capability for me. Two models at 92 and 99 on the same saturated benchmark might be indistinguishable on your actual task, or wildly apart. The benchmark won’t tell you which. You have to point them at the task and see, which is annoying, but I haven’t found a shortcut.

What honest evaluation would even look like

The closest analogy I keep coming back to is how good engineering treats correctness claims: tests written by people who aren’t the implementation, on cases the implementation didn’t get to peek at, with the reasoning checked, not just the final answer. None of that is anywhere near production-ready at frontier scale, and the labs all know it. So I’m not pretending there’s a simple drop-in fix.

The honest near-term answer is a little uncomfortable. Benchmarks aren’t going away. They’re still the cheapest way the field has to compare notes, and they’re useful as long as you don’t load too much on them. If a score stops being a capability claim and starts being one of several lossy signals you weigh against the actual task in front of you, the leaderboard goes from misleading to just lossy. Lossy is something you can live with. Just don’t forget that’s what it is.