Chapter 16 · Evaluation · 8 min
How do we know a model is better?
MMLU, HumanEval, LMSYS Arena. Why measuring LLM intelligence is hard — and why no single benchmark is enough.
How do you know if a model is good?
It seems like a simple question. The answer is complicated.
For a sorting algorithm, it's easy: does it sort correctly? How fast? For a language model, "good" can mean: accurate, honest, helpful, harmless, funny, concise, creative… and these qualities don't always point in the same direction.
Evaluating LLMs is a research field in its own right. Every benchmark captures something true and misses something important.
Automated benchmarks
MMLU — Breadth of knowledge
MMLU (Massive Multitask Language Understanding) tests 57 academic disciplines: medicine, law, chemistry, history, mathematics, philosophy… These are four-choice multiple-choice questions, automatically evaluated.
Average score of a well-educated human: ~90%. The best current models reach 88–89%.
What it measures: the breadth of knowledge stored in the parameters.
What it misses: the ability to reason about genuinely new situations, to acknowledge uncertainty, to detect a poorly framed question.
HumanEval — Code
164 Python programming problems. The model generates a function, automated unit tests verify it works. The standard metric is pass@k: generate k candidate solutions per problem (often k=1 or k=10), and count a success if at least one passes. pass@1 measures reliability, pass@10 measures raw capability.
What it measures: the ability to produce functional code on well-defined problems.
What it misses: the reality of development — understanding a bug in a 50,000-line codebase, refactoring, documenting.
MATH & GSM8K — Mathematics
MATH: 12,500 high school/competition-level math problems, in LaTeX. GSM8K: 8,500 arithmetic problems in natural language.
What it measures: multi-step mathematical reasoning.
What it misses: mathematical creativity, formal proof, discovery.
The human benchmark: LMSYS Arena
Arena is different. Anonymous humans pose any question to two models (displayed without names), read both responses, and choose their preference. The ELO score results from thousands of these duels.
It's the only benchmark that measures what humans actually prefer — in all their subjectivity. Ideal length, tone, format, humor, perceived honesty.
What it measures: overall human preference.
What it misses: factual accuracy (humans don't always know which answer is correct), specialized tasks, reproducibility.
Explore the radar
Here are five major models compared across six benchmarks. Click on a model to see its detailed scores, or on a benchmark to understand what it evaluates.
Each axis is a benchmark. Models have different profiles — strong in code, weak in long-form reasoning, or the reverse. No single radar gives the verdict: you have to combine objective benchmarks with human preferences.
What the radar reveals
Look carefully at the patterns:
No dominant model. Claude 3.5 Sonnet leads on HumanEval and BBH. GPT-4o dominates Arena and MATH. Llama 3.1 70B is competitive but behind proprietary models on almost everything.
Arena and academic benchmarks don't correlate perfectly. A model can be excellent on MMLU and average on Arena — and vice versa. Humans appreciate something different from academic accuracy.
Benchmarks are saturating. MMLU was hard in 2020 (GPT-3: 43%). By 2024, all major models are between 82 and 89%. Differentiation comes from elsewhere.
The fundamental problems of evaluation
Data contamination
If training data contains benchmark answers, the model has "cheated" without knowing it. This is a serious problem with public datasets like MMLU.
The solution: private benchmarks, regularly updated, whose questions don't circulate online. Hard to maintain at scale.
Benchmark hacking
Some labs optimize their models on benchmarks rather than for the capabilities they're supposed to measure. A model can learn to recognize the format of an MMLU question without truly understanding the material.
This is Goodhart's problem: when a measure becomes a target, it ceases to be a good measure.
The human preference question
Arena suffers from a bias: humans tend to prefer long, formatted responses (bullet points, headers, examples) even when a short answer would be more useful. Models that optimize for Arena become verbose.
What no benchmark measures
- The ability to detect an ambiguous question and ask for clarification.
- Honesty: knowing how to say "I don't know" instead of making things up.
- Consistency over long conversations.
- Causal reasoning in genuinely new situations.
- Adaptation to the user's context.
These qualities are difficult to measure automatically — and yet they're often the ones that matter most in practice.
Toward new evaluation paradigms
Research is exploring several directions:
LLM-as-a-judge: use a powerful LLM to evaluate another's responses. Scalable, but circular — the judge's biases contaminate the evaluation.
Adversarial benchmarks: humans actively try to trick models. Measures robustness, not just capabilities under normal conditions.
Real-task evaluation:
- SWE-Bench — real GitHub bugs to fix in existing codebases. The model gets a repo, a bug description, and must produce a patch that passes the tests. Much harder than HumanEval.
- GAIA — multi-step questions requiring reasoning, web search, file manipulation. Measures agentic capability.
- GPQA (Graduate-Level Google-Proof QA) — physics, chemistry, biology questions at PhD level, designed so you can't answer them via Google search. Distinguishes models that reason from ones that retrieve.
- ARC-AGI — abstract visual puzzles, designed to measure general reasoning over novel concepts. No model passed a human-level threshold until late 2024.
- Humanity's Last Exam — questions at the level of the world's best researchers, in domains where classical benchmarks are saturated.
Continuous automated evaluation: systems that continuously generate new questions to track model evolution.
The golden rule
No single benchmark will tell you whether a model is suited to your use case.
The best evaluation is always the same: build a dataset of your own real use cases, evaluate models on it, and compare on what matters to you — not on what matters for leaderboards.
Benchmarks are proxies. The only real test is your problem.
Updated