THINKPRM: Revolutionizing Scalable Reasoning Verification with Generative Process Reward Models

Enhancing Reasoning with Process Reward Models (PRMs)

Reasoning in large language models (LLMs) greatly benefits from increased computational testing, which relies on high-quality process reward models (PRMs) to identify the most promising reasoning paths for search or ranking. PRMs evaluate problem-solution pairs to determine solution correctness, traditionally implemented as discriminative classifiers. However, these classifiers demand extensive resources such as human annotations, gold-standard step-by-step solutions, or costly computational rollouts.

Challenges with Existing Verification Methods

Approaches that treat LLMs as judges offer better data efficiency and interpretability but fall short in complex reasoning tasks, often failing to detect incorrect reasoning. This creates a challenge to combine the interpretability and efficiency of LLM-judge methods with the superior accuracy of discriminative PRMs.

Generative PRMs: A Scalable Alternative

Generative PRMs address verification as a language generation problem, outputting correctness decisions as natural language tokens alongside verification chains-of-thought (CoT). These models estimate correctness probabilities conditionally, making them inherently interpretable and scalable. Techniques like Best-of-N selection and tree-based search enhance reasoning performance by leveraging extra inference-time computation, but their success depends heavily on verifier quality.

Introducing THINKPRM: Efficient and Powerful Verification

A collaboration between researchers at the University of Michigan, Mila, LG AI Research, and University of Illinois Urbana-Champaign led to THINKPRM, a long CoT verifier fine-tuned on drastically fewer process labels compared to discriminative PRMs. Utilizing the reasoning capabilities of long CoT models, THINKPRM surpasses both LLM-as-a-Judge and discriminative verifiers, using only 1% of the process labels from the PRM800K dataset across multiple challenging benchmarks.

Superior Performance Across Multiple Benchmarks

Under equal token budgets, THINKPRM scales verification compute more effectively than LLM-as-a-Judge, outperforming it by 7.2% on a ProcessBench subset. Evaluations compared THINKPRM to DiscPRM (a discriminative PRM fine-tuned on the full PRM800K dataset), majority voting methods, and verifier-weighted majority for best-of-N experiments. Tests spanned MATH-500 problems, AIME contests, and out-of-domain tasks like GPQA-Diamond physics problems and LiveCodeBench v5.

On MATH-500 best-of-N selection, THINKPRM achieved equal or better accuracy than DiscPRM at all sampling budgets. During verifier-guided search, THINKPRM-1.5B outperformed DiscPRM by about 5 percentage points and beat LLM-as-a-judge using the same base model. It also surpassed other strong PRMs like RLHFFlow-Deepseek-PRM by over 7% using 16 beams. For out-of-domain tasks, THINKPRM excelled by 8% on GPQA-physics and 4.5% on LiveCodeBench compared to DiscPRM.

Advantages of Generative PRMs for Scalable Verification

THINKPRM demonstrates that generative PRMs, trained with minimal supervision on synthetic data, enable efficient, scalable verification of step-by-step reasoning. Fine-tuning on as few as 8K process labels improves performance beyond zero-shot LLM-as-a-judge baselines and even exceeds discriminative PRMs requiring much larger labeled datasets. This highlights generative language-modeling objectives as key for interpretability, scalability, and data efficiency.

The research underscores the potential of generative PRMs to effectively scale verification compute at test time, significantly benefiting complex reasoning domains such as mathematics and science.