NVIDIA Unveils Nemotron ASR for Low-Latency Applications

Overview

NVIDIA has just released its new streaming English transcription model, Nemotron Speech ASR, built specifically for low latency voice agents and live captioning. The checkpoint nvidia/nemotron-speech-streaming-en-0.6b on Hugging Face combines a cache-aware FastConformer encoder with an RNNT decoder, tuned for both streaming and batch workloads on modern NVIDIA GPUs.

Model Design, Architecture, and Input Assumptions

Nemotron Speech ASR (Automatic Speech Recognition) is a 600M parameter model based on a cache-aware FastConformer encoder with 24 layers and an RNNT decoder. The encoder uses aggressive 8x convolutional downsampling to reduce the number of time steps, lowering compute and memory costs for streaming workloads. The model consumes 16 kHz mono audio and requires at least 80 ms of input audio per chunk.

Runtime latency is controlled through configurable context sizes. The model exposes 4 standard chunk configurations, corresponding to about 80 ms, 160 ms, 560 ms, and 1.12 s of audio, driven by the att_context_size parameter.

Cache Aware Streaming, Not Buffered Sliding Windows

Traditional ‘streaming ASR’ often uses overlapping windows, reprocessing previous audio to maintain context and increasing latency. Nemotron Speech ASR keeps a cache of encoder states, allowing each new chunk to be processed only once. This approach results in:

Non-overlapping frame processing, scaling work linearly with audio length.
Predictable memory growth, as cache size grows with sequence length.
Stable latency under load, which is critical for voice agents.

Accuracy vs Latency: WER Under Streaming Constraints

Nemotron Speech ASR is evaluated on Hugging Face OpenASR leaderboard datasets, including AMI, Earnings22, Gigaspeech, and LibriSpeech. Accuracy is reported as Word Error Rate (WER).

Performance Metrics

Approximately 7.84% WER at 0.16 s chunk size.
Approximately 7.22% WER at 0.56 s chunk size.
Approximately 7.16% WER at 1.12 s chunk size.

Developers can adjust the inference time based on application needs, balancing latency and accuracy.

Throughput and Concurrency on Modern GPUs

The cache-aware design greatly impacts concurrency. On an NVIDIA H100 GPU, Nemotron supports about 560 concurrent streams at a 320 ms chunk size, roughly 3x the concurrency of a baseline streaming system. Similar throughput gains are seen on RTX A5000 and DGX B200.

Latency Stability

Latency remains stable even as concurrency increases, with a median end-to-end delay around 182 ms during tests with 127 concurrent WebSocket clients.

Training Data and Ecosystem Integration

Nemotron Speech ASR is primarily trained on NVIDIA’s Granary dataset, totaling about 285K hours of audio, incorporating various public speech corpora.

Key Takeaways

Nemotron Speech ASR is a 0.6B parameter model for streaming that operates on 16 kHz mono audio with at least 80 ms chunks.
The model allows trading latency for accuracy with 4 configurable chunk sizes, keeping WER between 7.2% to 7.8%.
Cache-aware streaming avoids recomputation, yielding higher concurrency on various NVIDIA GPUs.
With high concurrency and low latency, Nemotron shows promising results for real-time applications.
Released under the NVIDIA Permissive Open Model License, it allows teams to self-host and fine-tune for specific applications.