REST Framework: Pushing Large Reasoning Models to Their Limits with Multi-Problem Stress Testing

Limitations of Current Evaluation Benchmarks for Large Reasoning Models

Large Reasoning Models (LRMs) have shown remarkable capabilities in domains such as mathematics, coding, and scientific reasoning. However, traditional benchmarks like GSM8K and MATH primarily focus on evaluating these models with single questions at a time. This approach has two major drawbacks. First, it leads to benchmark saturation where many top models achieve near-perfect scores, making it difficult to differentiate improvements. Second, single-question testing fails to simulate real-world scenarios where models must handle multiple problems simultaneously, reflecting the cognitive load and dynamic reasoning required in practical applications.

Introducing REST: A Multi-Problem Stress-Testing Framework

REST (Reasoning Evaluation through Simultaneous Testing) is a novel framework designed to evaluate LRMs by presenting multiple questions simultaneously within a single prompt. This method increases the stress level on LRMs, pushing their reasoning abilities beyond isolated problem-solving. REST reconstructs existing benchmarks by concatenating multiple problems, allowing control over how many questions are tested at once. It evaluates critical skills such as contextual priority allocation, resistance to cross-problem interference, and dynamic cognitive load management. The framework has been validated on 34 advanced LRMs ranging from 1.5 billion to 671 billion parameters, across seven diverse benchmarks from easy to challenging.

Key Findings from REST Evaluations

REST reveals significant insights about LRM capabilities:

Performance Drops Under Multi-Problem Stress: State-of-the-art models like DeepSeek-R1 experience accuracy declines up to 30% on challenging benchmarks such as AIME24 when tested with REST, highlighting limitations in multitasking ability.
Enhanced Differentiation Between Similar Models: REST amplifies performance gaps that single-question tests mask. For example, on MATH500, R1-7B’s accuracy falls to 66.75% under REST while R1-32B maintains 88.97%, creating a clear 22% performance gap.
Post-Training Does Not Guarantee Robustness: Models fine-tuned with reinforcement learning or supervised tuning on single problems often lose their advantages in multi-problem scenarios, suggesting a need to rethink training strategies.
Benefits of Long2Short Training: Models trained with long2short techniques, which promote concise and efficient reasoning, maintain better accuracy under REST stress tests, indicating a promising direction for improving multi-problem reasoning.

Realistic Reasoning Challenges Simulated by REST

By presenting multiple problems simultaneously, REST simulates real-world cognitive demands requiring models to prioritize context, avoid overthinking, and resist interference. It also identifies common failure modes such as question omission, summary errors, and reasoning mistakes that single-question evaluations fail to capture.

Evaluation Setup and Benchmark Scope

REST tested 34 models from 1.5B to 671B parameters on benchmarks including:

Simple: GSM8K
Medium: MATH500, AMC23
Challenging: AIME24, AIME25, GPQA Diamond, LiveCodeBench Model generation parameters follow official guidelines with output token limits up to 32K. The OpenCompass toolkit ensures standardized and reproducible evaluation.

REST’s Impact on Future LRM Development

REST revitalizes existing benchmarks without costly replacements, reflects realistic multi-task demands, and guides model improvements by emphasizing training methods like Long2Short. It sets a new standard for robust, application-relevant evaluation of large reasoning models.

For more information, check out the Paper, Project Page, and Code released by the research team.