MDM-Prime: Revolutionizing Masked Diffusion Models with Partial Token Unmasking

Masked Diffusion Models and Their Limitations

Masked Diffusion Models (MDMs) are effective at generating discrete data, such as text and symbolic sequences, by gradually unmasking tokens over multiple steps. During each step, tokens are either masked or unmasked. However, a significant inefficiency exists: up to 37% of the reverse process steps do not alter the sequence, resulting in unnecessary repeated processing and wasted computation.

Advancements in Discrete Diffusion Modeling

Discrete diffusion models have evolved from early binary data applications to handling complex tasks like text and image generation with various noise strategies. Recent innovations include simplifying training objectives, combining MDMs with autoregressive methods, energy-based sampling guidance, selective remasking, and distillation techniques to reduce sampling steps. Approaches using continuous noise for discrete data, such as Bit Diffusion, face challenges like intractable likelihoods due to quantization.

Introducing MDM-Prime and Partial Masking

MDM-Prime, developed by researchers from the Vector Institute, NVIDIA, and National Taiwan University, introduces a novel Partial Masking (Prime) scheme. Instead of binary masking, Prime allows tokens to adopt intermediate states by masking parts of a token’s encoded representation. This gradual revealing of token information enhances prediction quality and reduces redundant computations.

Architecture and Training Innovations

MDM-Prime decomposes tokens into sequences of sub-tokens via an invertible function, enabling smoother intermediate diffusion states and fewer idle steps. The reverse generation process is trained using a variational bound over sub-tokens. To handle sub-token dependencies and prevent invalid outputs, the model learns a joint probability distribution and discards inconsistent sequences. The architecture features an efficient encoder-decoder optimized for sub-token processing.

Performance on Text and Image Generation

Evaluations on text generation with OpenWebText reveal that MDM-Prime significantly lowers perplexity (15.36) and reduces idle step ratios, especially with sub-token granularity ℓ ≥ 4. It surpasses earlier methods without autoregressive techniques and generalizes well across zero-shot benchmarks. For image generation on CIFAR-10 and ImageNet-32, MDM-Prime (ℓ = 2) achieves superior sample quality and lower FID scores, outperforming baselines while maintaining efficiency. It also excels in conditional image generation by predicting masked sub-tokens from partially observed inputs.

Impact and Future Directions

MDM-Prime marks a shift toward more granular and efficient discrete data generation by enabling partial token unmasking. This approach reduces redundant computations and enhances expressiveness in generative modeling. Its strong performance on both text and images demonstrates its potential as a powerful tool for various discrete data generation tasks.

For more details, refer to the Paper, Project Page, and GitHub. Follow the researchers on Twitter and join the community for updates.