NVIDIA Launches PersonaPlex-7B-v1: Real-Time Speech Model

Overview

NVIDIA Researchers released PersonaPlex-7B-v1, a full duplex speech-to-speech conversational model designed for natural voice interactions with precise persona control.

Transition from Traditional Models to PersonaPlex

Conventional voice assistants often operate in a cascade: Automatic Speech Recognition (ASR) converts speech to text, a language model forms a text answer, and Text to Speech (TTS) converts that back to audio. This approach introduces latency and struggles with overlapping speech and interruptions.

PersonaPlex simplifies this process by utilizing a single Transformer model. It performs streaming speech understanding and speech generation in one architecture, operating on continuous audio encoded via a neural codec. The model optimizes text and audio token predictions, facilitating natural conversational dynamics—such as overlapping speech and quick turn-taking.

Dual Stream Configuration

The PersonaPlex model functions through a dual stream setup. One stream focuses on user audio, while the second stream tracks agent speech and text. Both streams share the model state, allowing the agent to listen while speaking, adapting in real time to user interruptions. This architecture draws inspiration from Kyutai’s Moshi full duplex framework.

Defining Conversational Identity

PersonaPlex employs two types of prompts:

Voice Prompt: Encodes vocal traits, speaking style, and prosody through a series of audio tokens.
Text Prompt: Elaborates on role, background, and scenario context.

Additionally, a system prompt allows up to 200 tokens for personalizing information aspects like name and organization.

Technical Architecture and Training

With 7 billion parameters, PersonaPlex uses the Moshi architecture and is built upon a Helium model backbone. A Mimi speech encoder, blending ConvNet and Transformer layers, translates audio waveforms into discrete tokens. The decoder processes these tokens to generate output audio at a 24 kHz sample rate.

Training Data

Training involves a mix of real and synthetic conversations, leveraging 7,303 calls from the Fisher English corpus (1,217 hours), enhanced by prompts created using GPT-OSS-120B for varied conversational styles. Synthetic data contributes significantly, with 39,322 assistant and 105,410 customer service conversations being generated.

Performance Evaluation

PersonaPlex is benchmarked against FullDuplexBench and ServiceDuplexBench for customer service interactions. It excels in metrics such as smooth turn-taking and interruption handling, achieving a takeover rate of 0.908 and 0.950, respectively, while maintaining low latency.

Key Takeaways

Architecture: A 7 billion parameter conversational model designed for full duplex interactions.
Streamlined Processing: Integrates audio and text predictions into a seamless interaction.
Persona Control: Utilizes hybrid prompting to define character traits and conversational context.
Diverse Training: Combines real and synthetic dialogue data for robust conversational training.
High Performance: Demonstrates superior efficiency in handling user interruptions and maintaining dialogue quality.