<RETURN_TO_BASE

PAN Generates Interactable Long-Horizon Video Worlds from Natural Language Actions

PAN maintains an internal latent world state and decodes action conditioned video segments to simulate long horizon futures, showing strong fidelity and stability for planning tasks

A persistent world model for action conditioned video

Most text to video systems produce a single clip and stop. PAN, developed at MBZUAI's Institute of Foundation Models, frames a different problem: maintain a persistent latent world state that evolves as natural language actions arrive, and decode those state updates into short video segments that show the consequence of each action. By repeating this cycle PAN can simulate long horizon futures conditioned on sequences of actions.

GLP architecture: separate dynamics from rendering

PAN implements a Generative Latent Prediction, GLP, stack that splits what happens in the world from how it looks. The pipeline has three stages:

  • A vision encoder maps frames into a latent world state.
  • An autoregressive latent dynamics backbone, implemented on a VLM foundation, predicts the next latent state conditioned on history and the current natural language action.
  • A video diffusion decoder reconstructs the corresponding short video segment from that latent state.

Concretely, PAN uses Qwen2.5-VL-7B-Instruct for the vision encoder and language backbone. The vision tower tokenizes frames into patches and produces structured embeddings. The backbone runs over a history of world states and actions plus learned query tokens to output the next latent. Latents live in the shared multimodal space of the VLM, which grounds dynamics in text and vision.

The video decoder is adapted from Wan2.1-T2V-14B and trained with a flow matching objective using one thousand denoising steps and a Rectified Flow formulation. The decoder conditions on both the predicted latent world state and the current action text via separate cross attention streams for world state and action.

Stabilizing long rollouts with Causal Swin DPM and sliding windows

Chaining single-shot video generators by conditioning only on the last frame produces discontinuities and rapid drift over long sequences. PAN addresses this with Causal Swin DPM, a chunk wise causal attention augmentation of a shift window denoising process. The decoder denoises over a sliding temporal window that contains two chunks of frames at different noise levels. During denoising one chunk moves from noisy to clean and then exits the window while a new noisy chunk enters. Chunk wise causal attention prevents later chunks from attending to unseen future actions, smoothing transitions and reducing error accumulation.

PAN also injects controlled noise into the conditioning frame rather than using a perfectly sharp frame. This suppresses incidental pixel details and biases the model to focus on stable structure such as object layouts and dynamics.

Two stage training and large scale compute

Training proceeds in two stages. First the team adapts Wan2.1-T2V-14B into the Causal Swin DPM architecture and trains the decoder in BFloat16 with AdamW, a cosine learning rate schedule, gradient clipping, FlashAttention3 and FlexAttention kernels, using a hybrid sharded data parallel scheme across 960 NVIDIA H200 GPUs.

In the second stage the frozen Qwen2.5-VL-7B-Instruct backbone is integrated with the video diffusion decoder under the GLP objective. The VLM remains frozen while query embeddings and the decoder are learned so predicted latents and reconstructed videos stay consistent. Long context sequences are handled with sequence parallelism and Ulysses style attention sharding. The team uses early stopping after one epoch once validation converges, even though the schedule allows up to five epochs.

Data come from public video sources covering everyday activities, object interactions, natural scenes and multi agent scenarios. Long videos are segmented with shot boundary detection; a filtering pipeline removes static or extreme clips, low quality footage, heavy text overlays and screen recordings. Clips are re-captioned with dense temporally grounded descriptions that emphasize motion and causal events.

Benchmarks and performance highlights

The team evaluates PAN on action simulation fidelity, long horizon forecast metrics, and simulative reasoning and planning against open source and commercial baselines including WAN 2.1/2.2, Cosmos 1/2, V JEPA 2, KLING, MiniMax Hailuo and Gen 3.

Key reported results include:

  • Action simulation fidelity: 70.3% accuracy for agent simulation, 47% for environment simulation, overall 58.6%, highest among open source models.
  • Long horizon forecast: Transition Smoothness 53.6% and Simulation Consistency 64.1%, outperforming baselines on these metrics.
  • Simulative planning: used inside an OpenAI-o3 based agent loop, PAN reaches 56.1% step wise simulation accuracy, best among open source world models.

Why PAN matters

PAN operationalizes Generative Latent Prediction at production scale by combining a Qwen2.5-VL-7B latent dynamics backbone with a Wan2.1-T2V-14B diffusion decoder and a Causal Swin DPM mechanism. The result is a practical, interactable world model that supports multi step action conditioned simulation, counterfactual rollouts, and use as an internal simulator for planning agents. By documenting training, data curation, and reproducible metrics, the project moves beyond toy demos toward a transparent simulation framework.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский