Signal vs Noise: Boosting LLM Decision Reliability with SNR

Why evaluation reliability matters

Evaluating large language models (LLMs) is expensive and time-consuming. As models scale, choosing the right benchmarks and metrics becomes crucial not just for leaderboard positions, but for making development decisions that actually transfer from small experiments to production-scale models. Recent research from the Allen Institute for AI (Ai2) reframes benchmarking around two core concepts: signal and noise, and their ratio, the signal-to-noise ratio (SNR). That perspective yields practical ways to reduce uncertainty and improve evaluation reliability.

What signal and noise mean in practice

Signal

Signal captures a benchmark's ability to separate stronger models from weaker ones. Concretely, it measures how widely model scores spread on a task. High signal means scores are well dispersed, making it easier to rank models and pick winners. Low signal produces tightly clustered scores that make meaningful comparisons difficult.

Noise

Noise is the variability in benchmark scores caused by randomness in training: different initializations, data ordering, or fluctuations between checkpoints in the same run. High noise undermines reliability because repeated runs can give inconsistent rankings even with identical configurations.

The importance of SNR

Ai2's key point is that neither signal nor noise alone determines a benchmark's usefulness; their ratio does. A high signal-to-noise ratio (SNR) indicates that a benchmark consistently produces trustworthy comparisons and is suitable for making development choices that will generalize to larger scales.

How SNR affects common development decisions

Two scenarios where benchmark reliability is critical:

Decision accuracy: Training multiple small-scale models with different recipes and selecting the best candidate to scale up. The central question is whether the small-scale ranking predicts large-scale performance.
Scaling law prediction: Fitting a scaling law on small models to forecast a much larger model's performance.

Ai2 shows that benchmarks with higher SNR yield better decision accuracy (R^2 = 0.626) and also better predictability of scaling law error (R^2 = 0.426). Low-signal or high-noise benchmarks increase the risk that small-scale findings will not hold at production scale.

Measuring signal and noise

Practical definitions used by Ai2:

Signal: the maximum difference (dispersion) in scores between any two models in a comparable population, normalized by the mean score.
Noise: estimated as the relative standard deviation of scores across the final checkpoints of a single model's training.

SNR can be computed as Relative Dispersion (Signal) divided by Relative Standard Deviation (Noise). Checkpoint-to-checkpoint noise correlates well with other randomness sources like initialization and data order, making it a practical proxy for overall noise.

Interventions that raise SNR

Ai2 tests several practical methods to increase SNR and thus evaluation reliability:

Filtering subtasks by SNR Multi-task benchmarks are often averages over many subtasks. Selecting a subset of high-SNR subtasks instead of using the full set can dramatically improve SNR and decision accuracy. For example, using the top 16 of 57 MMLU subtasks produced higher SNR and better predictions than the full suite, and also filtered out subtasks with high labeling error.
Averaging checkpoint scores Averaging scores across multiple final checkpoints, or using exponential moving averages during training, reduces transient noise. This approach consistently increased decision accuracy (example improvement of 2.4% reported) and lowered scaling law prediction errors across benchmarks.
Using continuous metrics like bits-per-byte (BPB) Classification metrics such as accuracy discard information present in continuous model outputs. Switching to a continuous metric like BPB (related to perplexity) can greatly boost SNR in generative tasks. Ai2 reports SNR improvements such as GSM8K rising from 1.2 to 7.0 and MBPP from 2.0 to 41.8, with corresponding decision accuracy gains (MBPP from 68% to 93%, Minerva MATH from 51% to 90%).

Practical guidelines for practitioners

Prefer benchmarks and subtasks with high SNR when making development choices intended to transfer to larger scales.
Quality beats quantity: a smaller set of high-SNR subtasks can outperform a larger but noisier benchmark.
Smooth out randomness by averaging checkpoints or using EMA during training to reduce noise-driven errors.
Use continuous evaluation metrics like BPB for generative and difficult tasks to improve SNR and ranking stability.

Ai2 supplements their findings with a public dataset of roughly 900,000 evaluations across 465 open models, providing a rich resource for further research and practical benchmarking improvements.