DreamGym: Meta's Textual World Model That Cuts Real RL Interactions
'Meta's DreamGym synthesizes environment interactions as text using a reasoning experience model and grounded replay memory, cutting real rollouts and boosting RL performance across web benchmarks.'
Why real-environment RL for LLM agents struggles
Reinforcement learning for large language model agents looks promising in theory, but in practice it runs into four linked obstacles: high cost for real rollouts, low task diversity, noisy or unstable reward signals, and complex brittle infrastructure. Web-based environments shift frequently, reward extraction often depends on fragile scrapers, many actions are irreversible, and episode resets are hard to enforce. These factors make long-horizon tasks noisy and sample inefficient: some RL-ready benchmarks need on the order of 80,000 real transitions to reach strong baselines with PPO or GRPO, and other benchmarks are effectively unusable for online RL because resets and automated reward checks fail.
The DreamGym idea: treat experience as text
DreamGym reframes the bottleneck as a modeling problem. Instead of running RL rollouts directly in complex web environments, DreamGym trains a reasoning-based experience model that simulates the environment entirely in text. The framework defines a synthetic Markov decision process where states, transitions and rewards are represented as compact textual descriptions rather than raw HTML or a browser state.
The reasoning-based experience model (Mexp)
The core of DreamGym is the experience model Mexp. It operates in an abstract textual state space: states are concise descriptions of the task-relevant parts of an environment (for example cleaned page elements instead of raw HTML). At each step the agent submits the current state, the chosen action, the task instruction and the interaction history. Mexp retrieves the top-k similar past transitions from the replay buffer and then uses chain-of-thought style reasoning to produce a reasoning trace, a next state and a reward.
Mexp is effectively an LLM world model defined over text. It is trained with supervised fine-tuning on offline trajectories using a joint objective that asks the model to generate both the reasoning trace and the next state conditioned on that trace. That training objective encourages the model to learn causal structure rather than just local text statistics.
Replay buffer as grounding memory
DreamGym initializes an experience replay buffer with offline data collected from WebShop, ALFWorld and WebArena Lite. As policies are trained inside the synthetic environment, newly generated trajectories are written back into the buffer. During prediction, Mexp encodes the current input and retrieves a small set of similar transitions from this memory to condition reasoning and next-state generation.
This retrieval mechanism provides grounding: it keeps synthetic transitions close to the empirical data distribution and reduces hallucinations during long rollouts. The research shows that removing history or retrieval lowers consistency, informativeness and factuality of generated states and reduces downstream success on WebShop and WebArena Lite.
Curriculum driven by reward entropy
The curriculum task generator shares the experience model backbone. It selects seed tasks whose outcomes under the current policy show high reward variance, which corresponds to intermediate difficulty tasks that the agent sometimes succeeds at and sometimes fails. For each selected seed, the model generates variations that preserve action types while changing constraints, targets or context.
The selection heuristic is reward entropy computed over batches of rollouts per task. Tasks with non-zero variance and a balance between success and failure are preferred. Ablations show that disabling this adaptive curriculum drops performance on WebShop and WebArena Lite by around six percentage points and causes early plateaus as the replay buffer fills with easy, low-entropy trajectories.
RL inside DreamGym and theoretical links to the real environment
Policies inside DreamGym use standard RL algorithms such as Proximal Policy Optimization and Group Relative Policy Optimization. Rollouts alternate between the policy selecting actions and the experience model synthesizing next states and rewards, and from the RL code's perspective DreamGym is simply another environment interface.
The team derives a trust-region style improvement bound that connects policy performance in the synthetic MDP to performance in the real environment. The bound contains error terms for reward prediction error and distributional divergence between synthetic and real transitions. As those errors shrink, improvement in DreamGym implies improvement in the real task.
Experimental results
DreamGym was evaluated with Llama-based and Qwen-based agents on WebShop, ALFWorld and WebArena Lite, producing three clear regimes:
- In RL-ready but costly environments (WebShop, ALFWorld), agents trained inside DreamGym with PPO or GRPO using only synthetic transitions match baselines that required about 80,000 real interactions. This indicates that reasoning-based experience synthesis can provide sufficient signal for stable policy improvement.
- In environments that are not RL-ready (WebArena Lite), DreamGym enables online RL that would otherwise be impractical, improving success rates by over 30% compared to non-RL baselines like supervised fine-tuning and behavior cloning.
- In sim-to-real transfer (DreamGym-S2R), policies pretrained entirely in the synthetic environment and then fine-tuned with a small number of real rollouts achieve over 40% additional gain compared with training from scratch in the real environment. This used less than 10% of the real data and reduced total training cost to roughly one third to one fifth of standard RL baselines.
Key takeaways
DreamGym replaces costly, brittle real environment rollouts with a reasoning-based text model that predicts next states and rewards from history, task instructions and retrieved similar transitions. Its three core components — Mexp, a replay buffer seeded with real trajectories, and a reward-entropy driven curriculum — stabilize and diversify RL training. The framework demonstrates strong gains across multiple web benchmarks and suggests a practical sim-to-real pattern for scaling RL for LLM agents.
Сменить язык
Читать эту статью на русском