Evaluation

Benchmark 飽和，其實是個驗證問題

GSM8k 99%、MMLU 90 出頭、HLE 在 2026 年中已進入 40 分檔。每出一份『更難的 benchmark』看起來都在解決問題，但結構性的事沒變：我們從來沒在驗證模型學會了什麼，只是在量它有沒有看過。

LLM Benchmark Saturation Is a Verification Problem

GSM8k at 99%, MMLU at the 88-94% noise band, HLE already in the mid-40s by mid-2026. Each round of harder benchmarks looks like progress, but the field never solved the underlying problem: we measure correlation with a test distribution and call it capability.