NVIDIA Unveils Nemotron ASR for Low-Latency Applications
Explore NVIDIA's new Nemotron Speech ASR model designed for voice agents and live captioning with low-latency performance.
Overview
NVIDIA has just released its new streaming English transcription model, Nemotron Speech ASR, built specifically for low latency voice agents and live captioning. The checkpoint nvidia/nemotron-speech-streaming-en-0.6b on Hugging Face combines a cache-aware FastConformer encoder with an RNNT decoder, tuned for both streaming and batch workloads on modern NVIDIA GPUs.
Model Design, Architecture, and Input Assumptions
Nemotron Speech ASR (Automatic Speech Recognition) is a 600M parameter model based on a cache-aware FastConformer encoder with 24 layers and an RNNT decoder. The encoder uses aggressive 8x convolutional downsampling to reduce the number of time steps, lowering compute and memory costs for streaming workloads. The model consumes 16 kHz mono audio and requires at least 80 ms of input audio per chunk.
Runtime latency is controlled through configurable context sizes. The model exposes 4 standard chunk configurations, corresponding to about 80 ms, 160 ms, 560 ms, and 1.12 s of audio, driven by the att_context_size parameter.
Cache Aware Streaming, Not Buffered Sliding Windows
Traditional ‘streaming ASR’ often uses overlapping windows, reprocessing previous audio to maintain context and increasing latency. Nemotron Speech ASR keeps a cache of encoder states, allowing each new chunk to be processed only once. This approach results in:
- Non-overlapping frame processing, scaling work linearly with audio length.
- Predictable memory growth, as cache size grows with sequence length.
- Stable latency under load, which is critical for voice agents.
Accuracy vs Latency: WER Under Streaming Constraints
Nemotron Speech ASR is evaluated on Hugging Face OpenASR leaderboard datasets, including AMI, Earnings22, Gigaspeech, and LibriSpeech. Accuracy is reported as Word Error Rate (WER).
Performance Metrics
- Approximately 7.84% WER at 0.16 s chunk size.
- Approximately 7.22% WER at 0.56 s chunk size.
- Approximately 7.16% WER at 1.12 s chunk size.
Developers can adjust the inference time based on application needs, balancing latency and accuracy.
Throughput and Concurrency on Modern GPUs
The cache-aware design greatly impacts concurrency. On an NVIDIA H100 GPU, Nemotron supports about 560 concurrent streams at a 320 ms chunk size, roughly 3x the concurrency of a baseline streaming system. Similar throughput gains are seen on RTX A5000 and DGX B200.
Latency Stability
Latency remains stable even as concurrency increases, with a median end-to-end delay around 182 ms during tests with 127 concurrent WebSocket clients.
Training Data and Ecosystem Integration
Nemotron Speech ASR is primarily trained on NVIDIA’s Granary dataset, totaling about 285K hours of audio, incorporating various public speech corpora.
Key Takeaways
- Nemotron Speech ASR is a 0.6B parameter model for streaming that operates on 16 kHz mono audio with at least 80 ms chunks.
- The model allows trading latency for accuracy with 4 configurable chunk sizes, keeping WER between 7.2% to 7.8%.
- Cache-aware streaming avoids recomputation, yielding higher concurrency on various NVIDIA GPUs.
- With high concurrency and low latency, Nemotron shows promising results for real-time applications.
- Released under the NVIDIA Permissive Open Model License, it allows teams to self-host and fine-tune for specific applications.
Сменить язык
Читать эту статью на русском