DanceGRPO: Revolutionizing Visual Generation with Unified Reinforcement Learning Across Modalities

Advancements in Visual Generative Models

Recent developments in generative models, particularly diffusion models and rectified flows, have significantly improved the quality and flexibility of visual content creation. However, aligning these models with human preferences requires integrating human feedback during training, which remains challenging.

Challenges in Current Reinforcement Learning Approaches

Existing reinforcement learning (RL) methods such as ReFL rely on differentiable reward models but suffer from high VRAM consumption, especially for video generation. Direct Policy Optimization (DPO) variants offer limited visual improvements. Moreover, RL methods face complications due to conflicts between ODE-based sampling and Markov Decision Process formulations, instability on large datasets, and insufficient validation on video tasks.

Leveraging Reinforcement Learning from Human Feedback (RLHF)

Aligning large language models (LLMs) uses RLHF by training reward functions from comparison data to capture human preferences. While policy gradient methods are effective, they are computationally expensive and require careful tuning. DPO trades off performance for cost efficiency. Recent advances like DeepSeek-R1 demonstrate that large-scale RL with specialized rewards can promote self-emergent reasoning in LLMs.

Introduction of DanceGRPO: A Unified Framework

Researchers from ByteDance Seed and the University of Hong Kong have developed DanceGRPO, a unified framework adapting Group Relative Policy Optimization (GRPO) for visual generation tasks. DanceGRPO seamlessly integrates with diffusion models and rectified flows, supporting text-to-image, text-to-video, and image-to-video generation.

Integration with Foundation and Reward Models

DanceGRPO works with four foundation models: Stable Diffusion, HunyuanVideo, FLUX, and SkyReels-I2V. It incorporates five specialized reward models assessing image/video aesthetics, text-image alignment, video motion quality, and binary reward thresholds.

Specialized Reward Models Explained

Image Aesthetics: Measures visual appeal based on models fine-tuned with human ratings.
Text-Image Alignment: Uses CLIP to ensure consistency between text and generated visuals.
Video Aesthetics Quality: Extends assessment to temporal aspects using Vision Language Models (VLMs).
Video Motion Quality: Evaluates motion realism through physics-aware VLM analysis.
Thresholding Binary Reward: Applies a discretization mechanism to assess abrupt reward distributions.

Performance Highlights

DanceGRPO outperforms baseline methods by up to 181% on key benchmarks like HPS-v2.1, CLIP Score, VideoAlign, and GenEval. For example, on Stable Diffusion v1.4, the HPS score rose from 0.239 to 0.365 and the CLIP Score improved from 0.363 to 0.395. In text-to-image tasks with HunyuanVideo, mean reward scores increased significantly, demonstrating better alignment with human aesthetics.

Broader Impact and Future Directions

DanceGRPO addresses critical limitations in prior RL methods by bridging language and visual modalities and scaling robustly across multiple tasks. The framework enhances visual fidelity, motion quality, and text-image alignment. Future research will explore extending GRPO to multimodal generation, aiming to unify optimization methods across generative AI.

For more details, refer to the original paper and project page. Follow updates on Twitter and join the 90k+ ML SubReddit community.