Enigmata Toolkit Revolutionizes Puzzle Reasoning in Large Language Models with Advanced Reinforcement Learning

Challenges in Puzzle Reasoning for Large Reasoning Models

Large Reasoning Models (LRMs), derived from Large Language Models (LLMs) and fine-tuned through reinforcement learning (RL), have shown impressive results in complex reasoning tasks such as mathematics, STEM, and coding. Despite these advancements, LRMs struggle with puzzle tasks that require purely logical reasoning — tasks that humans find straightforward. Existing approaches mainly focus on creating evaluation benchmarks but lack effective training methods and diverse puzzle datasets necessary for modern LLMs to improve in this area.

The Role of Reinforcement Learning with Verifiable Rewards (RLVR)

Reinforcement Learning with Verifiable Rewards (RLVR) has become a pivotal technique for enhancing reasoning abilities by providing direct, objective rewards based on verifiable answers instead of relying on reward models. Puzzles are ideally suited for RLVR because their solutions can be objectively verified. However, previous RLVR research has largely overlooked puzzles as a rich source of effective reward signals.

Introducing Enigmata: A Comprehensive Puzzle Reasoning Toolkit

A collaborative research team from ByteDance Seed, Fudan University, Tsinghua University, Nanjing University, and Shanghai Jiao Tong University developed Enigmata — the first extensive toolkit aimed at boosting LLMs' puzzle reasoning capabilities. Enigmata includes 36 tasks spanning seven categories: Crypto, Arithmetic, Logic, Grid, Graph, Search, and Sequential Puzzles. Each task features an auto-generator producing unlimited examples with adjustable difficulty, paired with a rule-based verifier for automatic evaluation.

Enigmata Dataset and Evaluation

The Enigmata-Data is uniquely scalable, diverse, and publicly available, overcoming limitations of previous puzzle datasets. It is constructed through a three-phase pipeline: task collection and design, auto-generator and verifier development, and sliding difficulty control. The Enigmata-Eval benchmark samples 50 puzzle instances per difficulty level for each task, culminating in 4,758 puzzle instances for rigorous evaluation.

Breakthrough Performance of Enigmata-Trained Models

Models trained using Enigmata data and multi-task RLVR strategies demonstrate state-of-the-art performance on benchmarks like AIME, BeyondAIME, and GPQA, especially large models such as Seed1.5-Thinking. The 32B parameter model outperforms many public models on Enigmata-Eval and excels on the challenging ARC-AGI benchmark, surpassing prominent reasoning models including Gemini 2.5 Pro, o3-mini, and o1.

Strengths and Insights from Enigmata Training

The Enigmata-trained models excel particularly in Crypto, Arithmetic, and Logic tasks, indicating strong rule-based reasoning capabilities. They also show competitive results in search-oriented tasks requiring strategic planning. However, spatial and sequential puzzles remain more challenging, highlighting areas for future improvement.

Broader Impact and Future Directions

The Enigmata framework not only advances puzzle reasoning but also benefits broader reasoning domains when integrated into larger models. Its open-source nature and comprehensive design provide a valuable foundation for the research community to push forward the development of reasoning models, bridging logical puzzle solving with more general reasoning skills in LLMs.

For more details, check the Paper, GitHub Page, and Project Page. Follow the research updates on Twitter and join the ML community via the 95k+ ML SubReddit and Newsletter.