Radial Attention Revolutionizes Video Diffusion: 4.4× Cost Reduction Without Quality Loss
Radial Attention introduces a novel sparse attention mechanism that cuts training costs by 4.4× and inference time by 3.7× in video diffusion models, enabling generation of longer videos without quality loss.
Challenges in Video Diffusion Models
Diffusion models have advanced significantly in generating high-quality, coherent videos by extending techniques successful in image synthesis. However, videos add a temporal dimension that dramatically increases computational demands. The self-attention mechanism, which is essential for capturing dependencies, scales poorly with sequence length, making it challenging to efficiently train or generate long videos. Existing approaches like Sparse VideoGen attempt to speed up inference via attention head classification but often compromise on accuracy and generalization. Other alternatives replace softmax attention with linear approximations, but these require substantial architectural changes. Insights from physics about natural energy decay in signals inspire new, more efficient modeling techniques.
Progress in Attention Mechanisms for Video Synthesis
Initial video diffusion models adapted 2D architectures by incorporating temporal elements. More recent models, such as DiT and Latte, improve spatial-temporal modeling through advanced attention methods. Although 3D dense attention achieves state-of-the-art results, its computational cost grows rapidly with video length, making long video generation expensive. Techniques like timestep distillation, quantization, and sparse attention reduce this cost but often ignore the unique structure of video data. Linear and hierarchical attention methods improve efficiency but struggle to maintain fine detail or scale well in practice.
Spatiotemporal Energy Decay and Introduction to Radial Attention
Researchers from MIT, NVIDIA, Princeton, UC Berkeley, Stanford, and First Intelligence discovered a phenomenon called Spatiotemporal Energy Decay, where attention scores between tokens diminish as spatial or temporal distance increases, similar to natural signal fading. Based on this, they developed Radial Attention, a sparse attention mechanism with O(n log n) complexity. It employs a static attention mask focusing mainly on nearby tokens, with the attention window shrinking over time. This approach enables pre-trained models to generate videos up to four times longer while cutting training costs by 4.4 times and inference time by 3.7 times, all without compromising video quality.
Sparse Attention Leveraging Energy Decay
Radial Attention capitalizes on the observation that attention strength weakens with increased spatial and temporal distance. Rather than attending equally to all tokens, it uses a sparse attention mask that decays exponentially in space and time, preserving only the most relevant interactions. This reduces computational complexity to O(n log n), making it much faster and more efficient than dense attention. Minimal fine-tuning with LoRA adapters allows pre-trained models to efficiently generate much longer videos.
Performance Evaluation on Leading Models
Radial Attention was tested on three prominent text-to-video diffusion models: Mochi 1, HunyuanVideo, and Wan2.1. It demonstrated superior speed and quality improvements compared to other sparse attention baselines like SVG and PowerAttention. The method achieved up to 3.7× faster inference and reduced training costs by 4.4× for longer videos. It scales well to videos four times longer and remains compatible with existing LoRA adapters, including style adapters. Notably, LoRA fine-tuning with Radial Attention sometimes even outperforms full fine-tuning, highlighting its efficiency and effectiveness for long video generation.
Scalable and Efficient Long Video Generation
Radial Attention offers a scalable sparse attention mechanism for efficient long video generation in diffusion models. Inspired by Spatiotemporal Energy Decay, it mimics natural signal fading to reduce computation via a static attention pattern with exponentially shrinking windows. This yields up to 1.9× faster performance and supports videos four times longer. Lightweight LoRA-based fine-tuning significantly decreases training and inference costs, all while maintaining video quality across state-of-the-art diffusion models.
For more details, check the paper and GitHub page. Follow updates on Twitter, YouTube, Spotify, join the 100k+ ML SubReddit, and subscribe to the newsletter.
Сменить язык
Читать эту статью на русском