Benchmarks are theater

GPT-5 came out eleven days ago. The benchmark sheet was impressive in the way they always are now: record scores across MMLU, HumanEval, the reasoning suites, the math olympiad tests. The discourse lasted about forty-eight hours before settling into the same verdict it always settles into: okay but is it actually better for what I’m doing.

The answer, for most people, was: kind of. Maybe. Hard to tell.

That’s where we are with frontier models in the summer of 2025. Every release is SOTA on every benchmark that existed before the release, and none of it tells you much about whether the thing will help you debug a gnarly async bug at 11pm or write a migration script you’d actually trust to run. The gap between the leaderboard number and the thing being useful has never been wider, and nobody wants to say it plainly because the whole ecosystem runs on hype cycles.

what the benchmarks are measuring

MMLU has been saturated for over a year. The models that “crush” it are trained on datasets that include or closely resemble the eval. HumanEval is a collection of toy Python puzzles that no working engineer would mistake for their actual job. The reasoning benchmarks are harder to game but still measure a narrow slice: structured deduction on well-posed problems, not the half-formed ambiguous mess you hand a model in real work.

The benchmark game has a structure. Labs need to announce a new model. A new model needs a story. The story needs numbers. So the numbers go up every time, by construction, because you can always find a benchmark where the new model outperforms the last one, and if the existing benchmarks are saturated you add new ones where you know you score well, and you release those numbers together so the headline reads “beats across the board.”

Grok 4 did this in July. Opus 4.1 did this earlier this month. GPT-5 did it. They’re all doing it. It’s not fraud exactly. The models are genuinely getting better at the things the benchmarks measure. It’s just that “better at benchmarks” and “better for building” have been decoupling for a while now and almost nobody in the announcement posts will say so.

what actually matters

The useful delta, when there is one, shows up in texture rather than scores. Claude Code got meaningfully better at holding context across a long multi-file edit somewhere between 3.5 and 4, not because of a benchmark, but because you’d notice it stop dropping the thread on a big refactor. o3 is genuinely good at certain kinds of multi-step reasoning in a way earlier models weren’t, and you feel it in cases where the old model would confidently go sideways. GPT-5 seems faster and less prone to the particular failure mode where it restates the question at length before doing anything useful.

None of that is in the benchmark sheet. You find it by using the thing on real work for a week and noticing whether your annoyance level went down.

The honest evaluation is: load up the model, point it at the actual task you’re stuck on, and see if the output is better than what you got before. That’s it. That’s the whole test. It takes twenty minutes and it’s more predictive than any academic eval you’ll read about on launch day.

The benchmark releases will keep coming. Each one will set new records. The “is this it?” discourse will run its forty-eight hour cycle and collapse. And then the people who actually build things will go back to their editors and figure out, tool by tool, which model is worth the API cost for which kind of work. Same as always.

The leaderboard doesn’t tell you that. The only way to find out is to use the thing on real work and notice whether it was faster, less wrong, less annoying. That test takes twenty minutes and doesn’t fit on a launch post.