Hallucinations Aren’t Magic: Why LLMs Make Confident Mistakes and How Benchmarks Encourage Them
Hallucinations as a statistical inevitability
Large language models (LLMs) often produce confident but incorrect outputs that look plausible. Recent OpenAI research explains these hallucinations as natural consequences of how we train generative models. Even with clean data, the cross-entropy objective used in pretraining creates statistical pressures that lead to errors similar to misclassifications in supervised learning.
Framing the problem with a simple task
The researchers reduce the phenomenon to a supervised binary classification problem called Is-It-Valid (IIV): decide whether an output is valid or erroneous. They prove a striking bound: the generative error rate of an LLM is at least twice its IIV misclassification rate. That means hallucinations arise for the same fundamental reasons supervised models make mistakes: epistemic uncertainty, limited model capacity, distribution shift, or noisy data.
Why rare facts are harder to trust
A key driver is the singleton rate — the share of facts appearing only once in the training corpus. By analogy with Good–Turing missing-mass estimation, if 20% of facts are singletons, at least 20% of those facts will be hallucinated. This explains the difference in behavior between well-attested facts (for example, Einstein’s birthday) and obscure facts that show up rarely in training data.
Model family limitations cause systematic errors
Hallucinations also stem from representational limitations of the model family. Classic examples include n-gram models producing ungrammatical sequences, or tokenized models miscounting letters because characters are embedded in subword tokens. When the model class cannot express a pattern, systematic errors appear even if the dataset contains enough information.
Post-training helps but doesn’t solve overconfidence
Post-training techniques such as RLHF, DPO, and RLAIF can mitigate harmful or biased outputs, reducing some forms of hallucination. However, overconfident hallucinations persist because of how we evaluate models. If benchmarks reward confident answers over calibrated uncertainty, models will learn to bluff.
How evaluation and leaderboards encourage guessing
Most popular benchmarks use binary scoring: correct answers get full credit, abstentions get none, and incorrect answers are not penalized more than abstentions. Under this regime, guessing maximizes scores even when the model is unsure. A model that truthfully abstains can score worse than one that confidently guesses, producing systemic incentives to prioritize confidence over calibration.
Concrete evaluation reforms to reduce hallucinations
The paper argues for socio-technical fixes to evaluation design. Benchmarks should set explicit confidence targets and score accordingly. For example: answer only if you are over 75% confident; mistakes lose 2 points, correct answers earn 1, and ‘I don’t know’ earns 0. This kind of scoring mirrors older exam formats that penalized blind guessing and encourages models to abstain when uncertain, improving calibration while still giving incentive to perform well.
Broader implications for model development
Reframing hallucinations as predictable outcomes of training objectives and misaligned evaluation shifts responsibility from mysterious model behavior to the way we assess and reward models. The findings suggest that pretraining makes some hallucinations inevitable, and current post-training plus benchmarking practices can reinforce them. Adjusting benchmarks to reward honest uncertainty and penalize confident errors can realign incentives and improve trustworthiness.
Further resources
The original paper and technical details expand on proofs and empirical evidence. The authors also provide a GitHub page with tutorials, code, and notebooks for reproducing experiments and testing evaluation alternatives.