Step-Audio-EditX: Open-Source 3B Audio LLM That Lets You Edit Speech Like Text

From waveform editing to token-level control

StepFun AI open sourced Step-Audio-EditX, a 3 billion parameter audio LLM that treats speech editing like a text operation rather than low-level waveform manipulation. Instead of operating on continuous signals, the model maps audio to discrete token streams and performs expressive edits at the token level, enabling direct, iterative, and controllable changes to emotion, speaking style, and paralinguistic cues.

Dual codebook tokenizer and a compact audio LLM

The system uses the Step-Audio dual codebook tokenizer. Speech is encoded into two interleaved token streams: a linguistic stream at 16.7 Hz with a 1024-entry codebook, and a semantic stream at 25 Hz with a 4096-entry codebook. Tokens preserve prosody and emotion, so representations remain partly entangled.

On top of that tokenizer sits a 3B parameter audio LLM. The model is initialized from a text LLM and trained on a blended corpus that mixes pure text and dual-codebook audio tokens in chat-style prompts. The model can read text tokens, audio tokens, or both, and always outputs dual-codebook audio tokens.

A separate audio decoder reconstructs waveform: a diffusion transformer flow-matching module predicts Mel spectrograms from audio tokens, reference audio, and a speaker embedding, and a BigVGANv2 vocoder produces the final waveform. The flow-matching module was trained on roughly 200,000 hours of high-quality speech to boost pronunciation and timbre fidelity. (See the paper: https://arxiv.org/pdf/2511.03601)

Large-margin synthetic data for controllability

Rather than adding complex disentangling encoders, Step-Audio-EditX relies on large-margin synthetic training data. The idea is to keep text fixed while changing one attribute with a clear margin, so the model learns to associate textual instructions with specific token-level edits.

For zero-shot TTS and cloning the team used a large in-house dataset spanning mainly Chinese and English, with some Cantonese and Sichuanese, covering about 60,000 speakers and broad intra- and inter-speaker variation. Emotion and style editing training uses synthetic triplets: human voice actors record 10-second clips for each emotion and style, then a cloning pipeline generates neutral and emotional versions for the same text and speaker. A margin scoring model trained on a small human-labeled set keeps only pairs with a score >= 6.

Paralinguistic edits such as breathing, laughter, and filled pauses are trained using a semi-synthetic approach on top of NVSpeech, building quadruplets where the target is original audio and the input is a cloned version with tags removed. This supplies time-domain supervision for paralinguistic tags.

Reinforcement learning data combines human pairwise preferences and an automatic comprehension model. Annotators rate 20 candidates per prompt on correctness, prosody, and naturalness, and pairs with margin > 3 are retained. The comprehension model scores emotion and speaking style on a 1 to 10 scale and selects pairs with margin > 8.

Training pipeline: SFT then PPO

Post-training proceeds in two stages: supervised fine tuning (SFT) followed by proximal policy optimization (PPO).

During SFT, the team frames zero-shot TTS and editing tasks in a unified chat format. For TTS tasks the prompt encodes a waveform as dual-codebook tokens, inserted into the system prompt as speaker info, and the user message is the target text. For editing tasks the user message contains original audio tokens and a natural language instruction; the model outputs edited audio tokens.

A 3B reward model initialized from the SFT checkpoint is trained on large-margin preference pairs using Bradley Terry loss. Critically, reward is computed directly on dual-codebook token sequences without decoding to waveform. PPO training uses the token-level reward, a clip threshold, and a KL penalty to balance instruction following with staying close to the SFT policy.

Evaluation with Step-Audio-Edit-Test and iterative editing

The team created Step-Audio-Edit-Test, using Gemini 2.5 Pro as an LLM judge to evaluate emotion, speaking style, and paralinguistic accuracy. The benchmark covers 8 speakers from Wenet Speech4TTS, GLOBE V2, and Libri Light, includes multiple languages, and tests multiple emotion, style, and paralinguistic categories with dozens of prompts per label.

Editing is measured iteratively: iteration 0 is the initial zero-shot clone, then three rounds of text instructions are applied. In Chinese emotion accuracy rose from 57.0 at iteration 0 to 77.7 at iteration 3; speaking style accuracy improved from 41.6 to 69.2. English showed comparable gains. An ablation where the same prompt audio is reused across iterations still yields improvements, supporting the large-margin learning hypothesis.

The model can also post-process outputs from closed-source TTS systems like GPT 4o mini TTS, ElevenLabs v2, Doubao Seed TTS 2.0, and MiniMax speech 2.6 hd. One editing iteration with Step-Audio-EditX improved emotion and style accuracy for all tested systems, and further iterations continued to help. Paralinguistic scores rose from about 1.91 at iteration 0 to 2.89 after a single edit, matching strong commercial baselines in many cases. (See the paper: https://arxiv.org/pdf/2511.03601)

Practical impact and open source release

Step-Audio-EditX demonstrates that treating speech as discrete tokens and training with large-margin synthetic data can deliver fine-grained, iterative control over emotion, style, and paralinguistic cues without heavyweight disentangling architectures. The full stack, including code and model weights, is available open source for developers to experiment with editing, TTS, and post-processing for other systems.