Decoupled Diffusion Transformers: Boosting Image Generation Speed and Quality with Semantic and Detail Separation

Diffusion Transformers in Image Generation

Diffusion Transformers have surpassed traditional models like GANs and autoregressive architectures in generating high-quality images. They work by progressively adding noise to images in a forward diffusion process and learning to reverse this through denoising, thereby approximating the original data distribution. Unlike UNet-based diffusion models, these transformers utilize the transformer architecture, which requires extensive training but yields excellent results. However, the training process is slow and computationally expensive because the model must simultaneously encode low-frequency semantic information and decode high-frequency details within the same modules, causing optimization conflicts.

Strategies to Improve Efficiency

To overcome these challenges, researchers have explored various approaches to enhance Diffusion Transformers’ efficiency. Methods such as optimized attention mechanisms (linear and sparse attention) help reduce computational overhead. Improved sampling techniques like log-normal resampling and loss reweighting stabilize training. Additionally, domain-specific inductive biases (REPA, RCG, DoD) and masked modeling improve structured feature learning and reasoning. Advanced models such as DiT, SiT, SD3, Lumina, and PixArt extend these frameworks to text-to-image and text-to-video generation.

Introducing the Decoupled Diffusion Transformer (DDT)

Researchers from Nanjing University and ByteDance Seed Vision proposed the Decoupled Diffusion Transformer (DDT), which separates the architecture into two distinct components: a condition encoder for semantic extraction and a velocity decoder for detailed image generation. This separation resolves the optimization conflict by handling low-frequency semantics and high-frequency details independently. The DDT-XL/2 model achieves state-of-the-art FID scores of 1.31 and 1.28 on ImageNet at 256×256 and 512×512 resolutions, respectively, while training up to four times faster.

Mechanisms Behind DDT’s Performance

The condition encoder extracts semantic features (zt) from noisy inputs, timesteps, and class labels. The velocity decoder uses these features to estimate the velocity field for image generation. Techniques like representation alignment and decoder supervision ensure consistent semantic representations across denoising steps. During inference, a shared self-conditioning mechanism reuses encoder outputs (zt) at selected timesteps to reduce computation. A dynamic programming algorithm optimally determines these timesteps, balancing speed and image quality.

Training and Evaluation

DDT models were trained on the 256×256 ImageNet dataset with a batch size of 256, without gradient clipping or warm-up. Performance was evaluated using metrics such as FID, sFID, IS, Precision, and Recall, employing VAE-ft-EMA and Euler sampling. The researchers improved baselines using SwiGLU, RoPE, RMSNorm, and lognorm sampling techniques. DDT consistently outperformed previous models, especially larger variants, and converged faster than competitors like REPA. Encoder sharing strategies and tuning the encoder-decoder size ratio further enhanced results.

Impact and Future Prospects

The decoupled design of DDT not only accelerates training but also enables efficient encoder sharing during inference, significantly reducing computational costs with minimal impact on output quality. The dynamic programming method for sharing decisions is a novel contribution that balances performance and speed. This research paves the way for more scalable and efficient diffusion transformer architectures in high-fidelity image generation tasks.