SynPref-40M and Skywork-Reward-V2: Revolutionizing Human-AI Alignment with Scalable Reward Models
SynPref-40M introduces a huge new preference dataset, enabling the Skywork-Reward-V2 family of models to achieve state-of-the-art results in human-AI alignment across multiple benchmarks.
Limitations of Current Reward Models
Reward models are fundamental in Reinforcement Learning from Human Feedback (RLHF), yet many leading open models fail to fully capture the complexity of human preferences. Despite advanced training methods, progress has been hampered by limitations in preference datasets, which tend to be narrow, artificially generated, or inadequately vetted. Rule-based systems work well for specific tasks like math or coding but cannot grasp subtle human judgments. Furthermore, benchmarks such as RewardBench are losing reliability as measures of real-world reward model effectiveness, showing poor correlation with downstream task success.
Challenges in Creating Preference Data and Innovative Solutions
Traditionally, preference data creation depends on human annotators, a process that is expensive, slow, and inconsistent. Recent techniques like RLAIF leverage large language models (LLMs) to automate annotation, sometimes surpassing human performance. Hybrid approaches now combine LLM-generated data with human validation to enhance quality. Reward models have evolved from simple scoring frameworks like the Bradley-Terry model to complex generative and optimization-based methods. Despite a wealth of models and datasets, accurately modeling nuanced human preferences across tasks and languages remains challenging.
SynPref-40M: A Massive Human-AI Preference Dataset
The team at 2050 Research and Skywork AI introduce SynPref-40M, a dataset containing 40 million preference pairs created using a two-stage human-AI pipeline. Human annotators ensure data quality through rigorous verification, while LLMs scale curation under human guidance. From this, they developed Skywork-Reward-V2, eight reward models sized between 0.6B and 8B parameters trained on a high-quality subset of 26 million pairs. These models achieve state-of-the-art performance on seven leading benchmarks, excelling in alignment, safety, objectivity, and robustness. This success stems from meticulous, iterative curation blending human expertise with AI scalability, not just data volume.
Two-Stage Human-AI Data Curation Pipeline
Open reward models often overfit to narrow benchmarks like RewardBench, limiting real-world application. To solve this, the researchers implemented a two-stage curation pipeline. Stage 1 uses human-verified annotations to train LLMs to label diverse preference attributes, followed by iterative error analysis and model refinement. Stage 2 scales the process by using consistency checks between the best model and a human-trained "gold" model to filter reliable samples without additional human input. This method balances quality and scalability, producing tens of millions of high-quality preference pairs.
Benchmarking Skywork-Reward-V2: Compact Models with Outstanding Performance
The Skywork-Reward-V2 series demonstrates superior results across various benchmarks, outperforming larger models (up to 70B parameters) and emerging generative reward models. Built on Qwen3 (0.6B–8B) and Llama 3.1/3.2 (1B–8B) backbones, these models achieve top scores on RewardBench, PPE, RM-Bench, and JudgeBench. The best model, Llama-3.1-8B-40M, achieves an average score of 88.6, surpassing all competitors. Despite smaller sizes, these models benefit from high-quality SynPref-40M data and efficient training, leading to better generalization in real-world RLHF applications. Notably, mid-sized models like Qwen3-1.7B outperform some 70B models, highlighting the importance of data quality and training methodology over sheer parameter count.
Future Directions: Precision Scaling in Reward Models
SynPref-40M and Skywork-Reward-V2 showcase how combining human judgment with AI scalability can produce large, high-quality preference datasets and powerful reward models. These models demonstrate strong generalization, alignment with human values, safety, and robustness against bias. Future work will explore novel training strategies as reward models become central to LLM development and alignment.
For more details, check out the paper, models on Hugging Face, and the GitHub repository. Follow the researchers on Twitter, YouTube, and Spotify, and join their ML subreddit and newsletter for updates.
Сменить язык
Читать эту статью на русском