TiDAR: NVIDIA's Hybrid Diffusion-Autoregressive Design That Multiplies LLM Throughput
NVIDIA's TiDAR combines one-step diffusion drafting with autoregressive verification in a single forward pass to exploit free GPU token slots and multiply tokens-per-forward by up to about 6x while preserving benchmark quality.
Why TiDAR matters
NVIDIA proposes TiDAR, a hybrid sequence model that combines diffusion drafting with autoregressive verification to maximize tokens produced per network forward. The core idea is to reuse otherwise idle GPU compute — so called free token slots — to draft many candidate tokens in parallel with diffusion, then verify and accept a subset autoregressively in the same forward pass. This aims to keep autoregressive level output quality while dramatically increasing throughput on modern GPUs.
Free token slots and the quality tradeoff
Standard autoregressive decoding produces one token per step and is often memory bound at realistic batch sizes, since latency is dominated by weight and KV cache loads rather than raw FLOPs. Diffusion LLMs exploit this by appending masked positions and denoising multiple tokens in parallel, but they tend to sample tokens independently within the same step. That intra step independence reduces sequence coherence and factual accuracy, erasing much of the theoretical speed advantage when quality is considered.
TiDAR targets that failure mode by preserving diffusion efficiency while recovering autoregressive sequence coherence through verification, using a single backbone and standard transformer infrastructure.
Dual mode backbone and attention layout
TiDAR partitions each generation step into three regions: a causal prefix of already accepted tokens, a drafting region containing tokens proposed in the previous step, and a mask region that holds candidates for the next drafting step. The attention mask is structured so the causal prefix follows standard causal attention while drafting and mask regions use bidirectional attention within a block. This is a refinement of Block Diffusion where only the decoding block is bidirectional and the rest stays causal.
Training doubles the sequence during optimization: the causal section contains the original input with labels shifted for next token prediction, and a corrupted copy fills a diffusion section. TiDAR employs a full mask strategy in the diffusion section, replacing all diffusion tokens with a mask token to produce a dense diffusion loss. Equalizing the number of diffusion and autoregressive loss terms and using a single loss weight simplifies optimization; experiments primarily use a weight of 1.
Self speculative generation in one forward pass
Generation is implemented as a self speculative flow executed in a single network evaluation per step. At the first step TiDAR encodes the prompt causally and runs one diffusion step over mask positions to draft a block. For subsequent steps each forward pass does two things at once: it verifies drafted tokens using autoregressive logits with a rejection sampling rule, and it pre-drafts the next block conditioned on all possible acceptance outcomes. Accepted tokens are appended to the prefix and kept in KV cache; rejected tokens and their cache entries are evicted. Because drafting and verification share the backbone and attention mask, diffusion drafting uses the free token slots in the same forward pass.
The model supports sampling modes that vary how much the final output trusts diffusion versus autoregressive heads. For the reported 8B variant, trusting diffusion predictions helped on some math benchmarks while rejection sampling preserved autoregressive quality.
Implementation details and training
TiDAR models were built by continual pretraining from Qwen2.5 1.5B and Qwen3 4B and 8B bases. The 1.5B model trained on 50B tokens using block sizes 4, 8 and 16; the 8B model trained on 150B tokens with block size 16. Both used max sequence length 4096, cosine LR schedule, distributed Adam in BF16, and a modified Megatron LM stack on NVIDIA H100 GPUs. Evaluations use lm_eval_harness across coding, math, reasoning, and knowledge benchmarks.
Quality and throughput
TiDAR 1.5B competes with its autoregressive counterpart on coding and math tasks while generating about 7.45 tokens per forward. TiDAR 8B shows minimal quality loss versus Qwen3 8B while reaching about 8.25 tokens per forward. On single H100 GPU, batch size 1, TiDAR 1.5B achieves roughly 4.71x decoding throughput versus Qwen2.5 1.5B, and TiDAR 8B about 5.91x versus Qwen3 8B, while preserving comparable benchmark performance.
Compared with other diffusion LLMs such as Dream and Llada, TiDAR outperforms them in both efficiency and accuracy when diffusion baselines decode one token per pass for best quality. Against speculative frameworks and Block Diffusion variants, TiDAR sits on a stronger efficiency quality frontier by converting more drafted tokens per forward into real accepted tokens per second, thanks to the unified backbone and parallel drafting and verification.
Practical takeaways
TiDAR demonstrates that one-step diffusion drafting plus autoregressive verification can coexist in a single transformer, exploiting free GPU token slots to raise tokens per network evaluation without sacrificing sequence level quality. The design supports exact likelihood evaluation via pure causal masking during evaluation and preserves exact KV cache semantics for accepted tokens, making TiDAR practical for production serving on H100 hardware.
Сменить язык
Читать эту статью на русском