ReasonFlux-PRM: Revolutionizing Chain-of-Thought Evaluation in Large Language Models

The Importance of Chain-of-Thought Reasoning in LLMs

Large language models (LLMs) increasingly rely on chain-of-thought reasoning to tackle complex problems like mathematics and scientific questions. Instead of providing immediate answers, these models generate intermediate reasoning steps that mimic logical thinking, improving accuracy and making it easier to identify errors.

Drawbacks of Existing Reward Models

Most traditional Process Reward Models (PRMs) focus solely on the final answer, neglecting the intermediate reasoning paths. However, advanced models like Deepseek-R1 produce detailed reasoning trajectories before final responses. Current PRMs cannot effectively evaluate these trajectories, resulting in unreliable supervision and poorer performance when training smaller models on such data.

Challenges with Current PRMs

Existing PRMs are optimized for clean, structured outputs rather than the often lengthy and disorganized reasoning chains from advanced LLMs. Even state-of-the-art models like Qwen2.5-Math-PRM-72B struggle to differentiate quality among intermediate steps, often assigning similar rewards to good and bad reasoning parts. This weak discrimination hampers data selection for fine-tuning and leads to models trained on PRM-filtered data underperforming compared to those using human-curated datasets.

Introducing ReasonFlux-PRM: A Trajectory-Aware Reward Model

A team from UIUC, Princeton, Cornell, and ByteDance Seed developed ReasonFlux-PRM, a novel reward model that evaluates both intermediate reasoning steps and final answers. It combines step-level and trajectory-level scoring for a comprehensive assessment of reasoning quality. Trained on a 10,000-sample dataset of curated math and science problems, ReasonFlux-PRM reflects real-world trajectory-response scenarios.

How ReasonFlux-PRM Works

ReasonFlux-PRM assigns scores to each reasoning step based on its contribution to the final answer, using a reference reward function that accounts for the prompt, previous steps, and the final output. These step scores aggregate into an overall trajectory reward. The model supports multiple uses, including offline filtering of high-quality training data, providing dense rewards during reinforcement learning with GRPO-based policy optimization, and Best-of-N selection at test time, enhancing inference quality.

Superior Performance on Reasoning Benchmarks

On benchmarks such as AIME, MATH500, and GPQA-Diamond, ReasonFlux-PRM-7B surpassed Qwen2.5-Math-PRM-72B and human-curated data by significant margins. It improved supervised fine-tuning accuracy by 12.1%, reinforcement learning by 4.5%, and test-time scaling by 6.3%. Despite its smaller size, ReasonFlux-PRM enabled the Qwen2.5-14B-Instruct model trained on its selected data to achieve near-human or better performance, while other PRMs caused up to 26.6% drops.

Advancing Reasoning Model Training

By integrating trajectory-level supervision, ReasonFlux-PRM addresses critical gaps in evaluating and training reasoning models. This approach enhances training data quality and model reliability, paving the way for improved chain-of-thought reasoning in large language models.

For more details, check the research paper and GitHub repository. Follow the project on Twitter and join the vibrant ML community on Reddit and via the newsletter.