Samsung Unveils ANSE: Smarter Noise Selection Enhances Text-to-Video Diffusion Models
Samsung Research developed ANSE, a novel framework that uses attention-based uncertainty estimation to improve noise seed selection in text-to-video diffusion models, boosting video quality and semantic consistency with minimal computational overhead.
Advancing Text-to-Video Generation with Diffusion Models
Text-to-video (T2V) diffusion models have revolutionized dynamic content creation by converting text prompts into detailed video sequences. These models start from Gaussian noise and progressively refine it into high-quality video frames that are both visually appealing and semantically aligned with the input text. Despite innovations like latent diffusion and motion-aware attention, a persistent challenge remains: video quality and consistency often vary depending on the initial noise seed, leading to unpredictable results and inefficiencies.
The Challenge of Noise Seed Selection
The initial random noise seed significantly affects video quality, temporal coherence, and prompt fidelity. Different seeds can produce drastically different outcomes from the same text prompt. Existing approaches to mitigate this, such as FreeInit, FreqPrior, or PYoCo, rely on handcrafted noise priors or external filtering methods, which are computationally expensive and may not generalize well across models or datasets. Moreover, these methods do not utilize internal signals from the model that could inform better seed selection.
Introducing ANSE: Active Noise Selection for Generation
Samsung Research developed ANSE, a novel framework that leverages the model's internal attention mechanisms to guide noise seed selection efficiently. The core component, BANSA (Bayesian Active Noise Selection via Attention), measures the consistency and confidence of attention maps during early denoising steps using a stochastic approach based on Bernoulli-masked attention sampling. This technique injects randomness directly into attention computations without requiring multiple complete forward passes, significantly reducing computational overhead.
How BANSA Works
BANSA evaluates entropy in attention maps to quantify uncertainty. By focusing on specific layers (layer 14 for CogVideoX-2B and layer 19 for CogVideoX-5B), it correlates well with full-layer uncertainty estimates, enabling efficient computation. The BANSA score compares the average entropy of individual attention maps to the entropy of their mean; a lower score indicates greater consistency and confidence in attention, which predicts better video quality. ANSE selects the noise seed with the lowest BANSA score from a pool of candidates, improving generation outcomes without retraining or external priors.
Performance Gains and Efficiency
On the CogVideoX-2B model, ANSE improved the total VBench score from 81.03 to 81.66, with notable gains in quality (+0.48) and semantic alignment (+1.23). For the larger CogVideoX-5B, the score rose from 81.52 to 81.71, with improvements of +0.17 in quality and +0.60 in semantic alignment. These gains were achieved with only an 8.68% and 13.78% increase in inference time for the two models respectively, a stark contrast to prior methods requiring over 200% more computation. Qualitative results demonstrated enhanced visual clarity, semantic consistency, and realistic motion depiction.
Comparative Analysis and Insights
BANSA outperformed random and entropy-based noise selection methods, with performance improvements saturating at 10 stochastic forward passes and a candidate pool size of 10. Deliberate selection of high BANSA score seeds degraded video quality, confirming the validity of BANSA as a confidence metric. While ANSE does not alter the diffusion process itself, it offers a practical surrogate for more expensive sampling techniques.
Future Directions
The researchers suggest potential enhancements through integration of information-theoretic methods or active learning strategies to further boost generation quality. ANSE’s design balances efficiency and effectiveness, making it a promising tool for scalable improvements in text-to-video synthesis.
Key Takeaways
- ANSE boosts VBench scores significantly on CogVideoX models.
- Quality and semantic alignment improvements are substantial with modest inference time increases.
- BANSA leverages Bernoulli-masked attention for precise uncertainty estimation.
- Layer-specific entropy calculations reduce computational costs.
- ANSE is more efficient than existing noise selection methods.
- Low BANSA scores reliably indicate better video generation outcomes.
This research introduces a principled, model-aware noise selection framework that enhances video generation quality and consistency by utilizing attention-based uncertainty estimation within diffusion models.
Сменить язык
Читать эту статью на русском