Alibaba Unveils Lumos-1: A Breakthrough Unified Autoregressive Model for Efficient Video Generation

Advancing Autoregressive Video Generation

Autoregressive video generation is an emerging field focused on synthesizing videos frame-by-frame by learning spatial and temporal patterns dynamically. Unlike conventional techniques that rely on pre-built frames or manual transitions, autoregressive models generate video content based on preceding tokens, similar to how large language models predict the next word. This approach aims to unify video, image, and text generation under a common transformer-based framework.

Challenges in Spatiotemporal Modeling

A key difficulty in this domain is accurately capturing the intrinsic spatiotemporal dependencies within videos. Videos inherently contain complex structures across time and space. Effective encoding of these dependencies is crucial for predicting coherent future frames. Poor modeling results in broken continuity or unrealistic outputs. Traditional training methods like random masking fail to balance learning signals across frames, often causing spatial information leakage that makes predictions trivial.

Limitations of Existing Methods

Many existing approaches try to improve autoregressive video generation but often diverge from standard large language model architectures. Some incorporate external pre-trained text encoders, increasing complexity and reducing coherence. Others suffer from slow generation speeds due to inefficient decoding. Models like Phenaki and EMU3 support end-to-end generation but face challenges with performance consistency and high training costs. Techniques such as raster-scan ordering or global attention do not scale well for high-dimensional video data.

Introducing Lumos-1: A Unified Solution

The research team from Alibaba Group’s DAMO Academy, Hupan Lab, and Zhejiang University developed Lumos-1, a unified autoregressive video generation model that adheres closely to large language model architecture. Lumos-1 eliminates the need for external encoders and requires minimal changes to the original LLM design.

Innovation with MM-RoPE and AR-DF

Lumos-1 employs MM-RoPE (Multi-Modal Rotary Position Embeddings) to tackle the challenge of representing video’s three-dimensional structure. This method extends traditional RoPE by balancing frequency allocations across temporal, height, and width dimensions, preventing detail loss and ambiguous positional encodings.

Additionally, Lumos-1 introduces AR-DF (Autoregressive Discrete Diffusion Forcing), which applies temporal tube masking during training to prevent over-reliance on unmasked spatial information. This approach ensures balanced learning across frames and maintains high-quality frame generation during inference.

Training and Performance

Trained from scratch on 60 million images and 10 million videos using just 48 GPUs, Lumos-1 demonstrates memory-efficient scalability. It achieves competitive results on multiple benchmarks, matching or rivaling leading models like EMU3 on GenEval, COSMOS-Video2World on VBench-I2V, and OpenSoraPlan on VBench-T2V.

Versatile Multimodal Generation

Lumos-1 supports text-to-video, image-to-video, and text-to-image generation, showcasing strong generalization across modalities. This versatility underscores its potential as a unified framework for multimodal content creation.

Setting a New Standard

By addressing core challenges in spatiotemporal modeling and integrating advanced architectures with innovative training strategies, Lumos-1 sets a new benchmark for efficient and effective autoregressive video generation. It opens avenues for scalable, high-quality video synthesis and future multimodal AI research.

For more details, check out the Paper and GitHub repository. All credit goes to the dedicated researchers behind this project.