DSRL: Steering Robot Policies via Latent-Space Reinforcement Learning for Real-World Adaptation

Advancing Robotic Control with Learning-Based Methods

Robotics has transitioned from traditional hand-coded control to data-driven learning approaches. Instead of explicit instructions, robots now learn behaviors by imitating observed actions, primarily through behavioral cloning. This enables functionality in structured environments but faces challenges when applied to dynamic, real-world scenarios where adaptability and refinement are essential.

Challenges of Behavioral Cloning and Policy Adaptation

Robotic policies often rely on human demonstrations collected beforehand, which are used to train initial models through supervised learning. However, these policies usually struggle to generalize to new environments or tasks, requiring costly additional demonstrations and retraining. Reinforcement learning (RL) offers autonomous improvement but is hindered by sample inefficiency and the need for direct access to complex policy models, limiting real-world applicability.

Limitations in Combining Diffusion Models and Reinforcement Learning

Recent methods integrating diffusion-based policies with RL focus on manipulating early diffusion steps or adjusting outputs to optimize expected rewards. While effective in simulations, these approaches demand extensive computation and internal access to policy parameters, making them impractical for black-box or proprietary models. Backpropagation through multi-step diffusion processes also introduces instability.

Introducing DSRL: Diffusion Steering via Reinforcement Learning

The DSRL framework, developed by researchers from UC Berkeley, University of Washington, and Amazon, shifts adaptation from direct policy weight modification to optimizing the latent noise inputs of the diffusion model. Instead of sampling from a fixed Gaussian noise, a secondary policy is trained via RL to select latent noise that guides the resulting actions toward better performance. This allows efficient fine-tuning without altering the base diffusion model or requiring internal access.

Decoupling Actions Through Latent-Noise Space

DSRL maps the robot's action space into a latent-noise space, where the RL agent selects noise vectors that the diffusion policy transforms into actions. Treating noise as actions enables a reinforcement learning setup external to the base policy, relying solely on forward passes. This design supports black-box model scenarios common in real-world robotics. The latent-noise policy can be trained using standard actor-critic algorithms, avoiding costly backpropagation through diffusion steps and enabling both online and offline learning.

Performance Gains and Practical Impact

Empirical results demonstrated dramatic improvements in task success and data efficiency. In one real robotic task, success rates increased from 20% to 90% within fewer than 50 episodes. DSRL also enhanced a generalist robotic policy named π₀ effectively without modifying its diffusion policy or requiring parameter access. These results highlight DSRL's practicality in restricted environments such as API-only deployments.

Summary

DSRL provides a powerful, efficient, and stable approach to adapting diffusion-based robotic policies without retraining or internal model access. By leveraging latent-noise steering, it opens new possibilities for deploying adaptable robotic systems in real-world settings.