LLM Benchmark Saturation Is a Verification Problem
GSM8k at 99%, MMLU at the 88-94% noise band, HLE already in the mid-40s by mid-2026. Each round of harder benchmarks looks like progress, but the field never solved the underlying problem: we measure correlation with a test distribution and call it capability.