OMEGA Benchmark: Testing the Creative Limits of AI in Math Reasoning

Challenges in Mathematical Reasoning for Large Language Models

Large language models (LLMs) like DeepSeek-R1 have demonstrated strong capabilities in Olympiad-level mathematics through long chain-of-thought (CoT) reasoning. However, these models often rely on memorized techniques such as repeating algebraic rules or defaulting to coordinate geometry, limiting their ability to solve complex problems that require original mathematical creativity. Current math datasets mix a wide range of topics and difficulties, making it hard to isolate and analyze specific reasoning skills that reinforcement learning (RL) models can develop.

Limitations of Existing Math Benchmarks

Existing benchmarks focus on out-of-distribution generalization and compositional generalization, helping models handle unseen test distributions and combine learned skills. Popular datasets include human-written problems (GSM8K, MinervaMath), exam collections (AIME, OlympiadBench), and scraped corpora (NuminaMath, BigMath). Despite their variety, these datasets either lack sufficient challenge or fail to provide granular analysis of reasoning abilities.

Introducing OMEGA: A Controlled Benchmark for Reasoning Skills

A team of researchers from the University of California, Ai2, the University of Washington, and dmodel.ai developed OMEGA, a benchmark that evaluates three dimensions of out-of-distribution generalization inspired by Boden’s creativity typology: Exploratory, Compositional, and Transformative reasoning. OMEGA uses matched training and test pairs with engineered templates to control problem diversity, complexity, and reasoning strategies. It features 40 templated problem generators spanning six mathematical domains: arithmetic, algebra, combinatorics, number theory, geometry, and logic & puzzles.

Evaluation of Leading Models and Reinforcement Learning

The study evaluated four state-of-the-art models—DeepSeek-R1, Claude-3.7-Sonnet, OpenAI-o3-mini, and OpenAI-o4-mini—across various complexity levels. Reinforcement learning experiments employed the GRPO algorithm on 1,000 training problems using Qwen2.5-7B-Instruct and Qwen2.5-Math-7B models.

Exploratory generalization involved training on lower complexity problems and testing on higher complexity ones.
Compositional generalization assessed the ability to combine isolated skills.
Transformational generalization tested performance on problems requiring novel, unconventional strategies.

Key Findings on Model Performance

LLMs showed declining accuracy as problem complexity increased, often spending excessive tokens on verification despite early correct solutions. Reinforcement learning improved generalization from low to medium complexity, especially on in-domain problems, reinforcing familiar reasoning patterns. For example, in the Zebra Logic domain, RL training boosted accuracy from 30% to 91% on in-domain problems and significantly improved out-of-distribution results without supervised fine-tuning.

Insights and Future Directions

The research highlights that RL fine-tuning enhances performance on in-distribution and exploratory tasks but offers limited benefits for compositional reasoning and fails to foster truly novel reasoning patterns. This underscores a fundamental limitation where RL expands problem-solving within known patterns but does not enable creative breakthroughs essential for transformational reasoning. Future research may explore curriculum scaffolding and meta-reasoning controllers to overcome these hurdles.

For more details, visit the Paper, Project Page, and GitHub Page. Follow updates on Twitter and join the 100k+ ML SubReddit or subscribe to the newsletter.