Uni-MoE-2.0-Omni: An Open Omnimodal MoE Unifying Text, Image, Audio and Video

Language-centric core and unified token interface

Uni-MoE-2.0-Omni builds a language-centric omnimodal backbone on a Qwen2.5-7B style transformer. Text, image, audio and video inputs are converted into token sequences that share a single interface with the language model. Pretrained visual encoders turn images and frames into token sequences, a unified speech encoder maps environmental sounds, speech and music into the same representation space, and all tokens enter the transformer so the same self-attention layers can reason across modalities.

Unified modality encoding and 3D RoPE

To preserve spatio-temporal structure the system extends rotary positional embeddings into three coordinates for visual and audio streams: time, height and width. Speech tokens receive time coordinates. This Omni Modality 3D RoPE gives the transformer explicit information about when and where tokens occur, which is crucial for video understanding and audio-visual reasoning.

MoE-driven fusion with dynamic capacity routing

Standard MLP blocks are replaced by Mixture of Experts layers that contain three expert types:

Empty experts: act as null functions to allow computation skipping at inference time.
Routed experts: modality-specific experts that store domain knowledge for audio, vision or text.
Shared experts: small always-active experts that provide communication paths across modalities.

A routing network decides which experts to activate per token, enabling specialization without the full cost of running all experts densely. The design supports roughly 10 cross-modal input configurations (for example, image+text, video+speech, or tri-modal combinations).

Training recipe: cross-modal pretraining, progressive SFT and GSPO+DPO

The model is trained from scratch on a Qwen2.5-7B dense backbone using about 75B open-source multimodal tokens. Training proceeds in stages:

Language-centric cross-modal pretraining on paired image-text, audio-text and video-text corpora to align modalities into a shared semantic space.
Progressive supervised fine-tuning that activates modality-specific expert groups and introduces control tokens enabling tasks like text-conditioned speech synthesis and image generation within the same interface.
Data-balanced annealing to reweight datasets across modalities, reduce overfitting to a single modality and stabilize the final omnimodal behavior.
Iterative policy optimisation using GSPO and DPO to produce the Uni-MoE-2.0-Thinking variant that strengthens long-form, step-by-step reasoning. GSPO generates preference signals (from the model itself or another LLM) and DPO converts preferences into stable policy updates.

Controlled generation: MoE TTS and task-aware diffusion

Speech generation is handled by a context-aware MoE TTS module that sits on top of the language model. The LLM emits control tokens describing timbre, style and language alongside textual content; the MoE TTS produces discrete audio tokens which an external codec decodes to waveforms, aligning input and output paths.

Image generation and editing are driven by a task-aware diffusion transformer conditioned on task tokens and image tokens. Task tokens indicate modes such as text-to-image generation, editing or enhancement, while lightweight projectors map omnimodal tokens into the diffusion conditioning space. This keeps the main omnimodal model frozen during final visual fine-tuning and makes image synthesis instruction-guided and controllable.

Benchmarks and open checkpoints

Uni-MoE-2.0-Omni was evaluated on 85 multimodal benchmarks covering image, text, video, audio and cross/tri-modal reasoning. On 76 benchmarks shared with Qwen2.5-Omni, Uni-MoE-2.0-Omni surpasses Qwen2.5-Omni on over 50 tasks. Notable gains include about +7% average on video understanding across 8 tasks, +7% average on omnimodality understanding across several benchmarks (including OmniVideoBench and WorldSense), and around +4% on audio-visual reasoning.

For long-form speech, the model reduced word error rate by up to 4.2% relative on long LibriSpeech splits and improved TinyStories-en text-to-speech WER by roughly 1%. Image generation and editing metrics are competitive with specialized visual models, with modest gains on benchmarks like GEdit Bench compared to Ming Lite Omni and improvements over Qwen Image and PixWizard on several low-level metrics.

What this design achieves

By combining a language-centric transformer backbone, a unified token interface, Omni Modality 3D RoPE for spatio-temporal alignment, and a Dynamic Capacity MoE with empty, routed and shared experts, Uni-MoE-2.0-Omni balances compute and capability while supporting diverse cross-modal inputs and both understanding and controllable generation of images, text and speech. The staged training recipe and GSPO+DPO optimisation further push the model toward stronger long-form reasoning in the Uni-MoE-2.0-Thinking variant.

For the paper, model weights and code, see the project page and repository linked by the authors.