Liquid AI Launches LFM2-Audio-1.5B: End-to-End Audio Model with Sub-100 ms Latency
What LFM2-Audio-1.5B brings
Liquid AI has introduced LFM2-Audio-1.5B, a compact audio‑language foundation model that reads and produces both speech and text through a single end-to-end stack. The design targets low-latency, real-time assistants on resource-constrained devices by extending the LFM2 family to handle audio as a first-class modality while keeping a small footprint.
Unified backbone and disentangled audio I/O
LFM2-Audio adapts the 1.2B-parameter LFM2 language backbone to treat audio and text as sequence tokens. A key innovation is the disentangling of audio input and output representations: inputs are continuous embeddings projected directly from raw waveform chunks of roughly 80 ms, while outputs are discrete audio codes. This approach avoids discretization artifacts on the input path and preserves autoregressive training and generation on the output path for both modalities.
Model components and implementation
The released checkpoint and model card describe the stack used in the release:
- Backbone: LFM2 (hybrid convolution + attention), 1.2B parameters for the language model component
- Audio encoder: FastConformer (~115M, canary-180m-flash)
- Audio decoder: RQ-Transformer predicting discrete Mimi codec tokens (8 codebooks)
- Context window: 32,768 tokens
- Vocabulary sizes: 65,536 for text; 2049×8 for audio
- Precision: bfloat16
- License: LFM Open License v1.0
- Languages: English
Liquid AI also provides a Python package (liquid-audio) and a Gradio demo to reproduce generation behaviors.
Two generation modes for agents
LFM2-Audio supports two generation modes tailored for different real-time needs:
- Interleaved generation: Alternates text and audio tokens for live speech-to-speech chat, minimizing perceived latency by emitting early audio.
- Sequential generation: Switches modalities turn-by-turn for ASR and TTS workflows.
These modes let the model be used for speech recognition, text-to-speech, classification, and conversational agents from the same backbone.
Latency and benchmarks
Liquid AI reports end-to-end latency below 100 ms to the first audible response from a 4-second audio query under their test setup. This is presented as a proxy for perceived responsiveness and is claimed to be faster than models smaller than 1.5B parameters in their comparisons.
On VoiceBench, a suite of nine audio-assistant evaluations introduced in late 2024, LFM2-Audio-1.5B scores an overall 56.78 with per-task breakdowns shown in the Liquid AI blog. The Hugging Face model card adds an alternative VoiceBench table and classic ASR WER comparisons, where LFM2-Audio matches or improves on Whisper-large-v3-turbo on some datasets (for example, AMI 15.36 vs 16.13, LibriSpeech-clean 2.03 vs 2.10, lower is better).
Why this matters for voice AI
Many omni stacks chain ASR → LLM → TTS, which increases latency and introduces brittle interfaces. LFM2-Audio’s single-backbone design, continuous input embeddings and discrete output codes reduce glue logic and enable interleaved decoding for early audio emission. For developers, that translates into simpler pipelines and faster perceived response times while keeping multi-task support in one model. The release, demos and Hugging Face distribution make it straightforward to experiment with real-time audio agents on constrained hardware.