LongCat Flash Omni — Open-Source 560B Omni-Modal Model for Real-Time Audio-Visual Interaction

Overview

Meituan's LongCat team released LongCat Flash Omni, an open-source omni-modal model built on a 560 billion parameter Mixture of Experts (MoE) backbone that activates roughly 27 billion parameters per token. The design extends the LongCat Flash language backbone to handle vision, video and audio while preserving long-context capabilities (128K tokens) so the same stack can manage lengthy conversations and document-level understanding.

Architecture and modal attachments

Instead of redesigning the LLM, LongCat Flash Omni keeps the language model intact and attaches specialized perception modules. A unified LongCat ViT encoder processes both images and video frames, eliminating the need for a separate video tower. Audio is converted into discrete tokens via an audio encoder coupled with a LongCat Audio Codec; the same LLM stream that consumes these tokens can also produce audio, enabling real-time two-way audio-visual interaction.

Streaming and feature interleaving

The team implements chunk-wise audio-visual feature interleaving: audio features, video features and timestamps are packed into 1-second segments. Video is sampled by default at 2 frames per second, with the sampling rate adjusted according to video duration — a duration-conditioned sampling strategy rather than tying sampling to speaking phases. This approach reduces latency while preserving spatial context required for GUI understanding, OCR and video question answering.

Training curriculum

Training follows a staged curriculum. The text backbone is trained first (LongCat Flash text pretraining), which yields per-token activation in the range of 18.6B to 31.3B parameters and averages about 27B. Continued pretraining phases include text-to-speech alignment, multimodal pretraining with image and video, extension of context length to 128K, and final alignment with the audio encoder.

Systems design and modality-decoupled parallelism

Because encoders and the LLM exhibit different compute patterns, Meituan uses modality-decoupled parallelism. Vision and audio encoders run with hybrid sharding and activation recomputation, while the LLM uses pipeline, context and expert parallelism. A ModalityBridge aligns embeddings and gradients between modules. According to the release, multimodal supervised fine-tuning (SFT) retains more than 90% of the throughput of text-only training — the main systems achievement highlighted by the team.

Benchmarks

LongCat Flash Omni scores 61.4 on OmniBench, outperforming Qwen 3 Omni Instruct (58.5) and Qwen 2.5 Omni (55.0), while trailing Gemini 2.5 Pro (66.8). On VideoMME it achieves 78.2 — comparable to GPT-4o and Gemini 2.5 Flash — and on VoiceBench it records 88.7, slightly above GPT-4o Audio in the same comparison.

What this means

LongCat Flash Omni demonstrates a practical route to omni-modal interaction: a high-capacity MoE language backbone that remains inference-friendly thanks to shortcut-connected experts, a unified vision/video encoder, and a streaming audio path that enables synchronized decoding across modalities. Combined with a duration-conditioned, low-latency sampling strategy and modality-decoupled parallelism, the release aims to make real-time any-to-any audio-visual interaction feasible while preserving long-context reasoning and competitive benchmark performance.

Further resources

The code, model weights and additional documentation are available on the project GitHub repository. The release includes pointers to paper and notebooks for those who want to reproduce experiments or build on the model.

LongCat Flash Omni — Open-Source 560B Omni-Modal Model for Real-Time Audio-Visual Interaction

Сменить язык