USE-DDP: Dual-Branch Encoder–Decoder Cleans Speech Using Only Unpaired Data
What problem does USE-DDP solve?
Most learning-based speech enhancement systems rely on paired clean–noisy recordings, which are expensive or infeasible to collect at scale in realistic environments. USE-DDP (Unsupervised Speech Enhancement using Data-defined Priors) asks whether a model can learn to split a noisy recording into clean speech and residual noise using only unpaired datasets: a clean-speech corpus and an optional noise corpus. The team from Brno University of Technology and Johns Hopkins University answers with a dual-branch encoder–decoder and an adversarial prior setup that learns solely from these unpaired data.
Model architecture
The generator resembles a neural audio codec. An encoder compresses the input waveform into a latent sequence. That latent is split into two parallel transformer branches (based on RoFormer): one branch targets clean speech, the other targets noise. A single shared decoder reconstructs two waveforms from the two branches: an estimated clean speech signal and an estimated residual noise waveform. The design enforces that the two outputs sum back to the input (with scalar compensation factors α and β to correct amplitude mismatches), turning enhancement into an explicit two-source estimation problem.
Training losses and adversarial priors
Reconstruction consistency is central: the sum of the estimated speech and noise must match the observed mixture. Reconstruction losses include multi-scale mel and STFT objectives plus SI-SDR, similar to losses used in neural audio codecs.
Priors are imposed via three discriminator ensembles: clean, noise, and noisy. Each ensemble enforces distributional constraints through LS-GAN and feature-matching losses. The clean branch must produce outputs that resemble samples from the clean-speech corpus; the noise branch must match a noise corpus; and the reconstructed mixture must sound natural. This keeps the training data-driven and avoids relying on external intrusive metrics during optimization.
Initialization and practical choices
Initializing the encoder and decoder from a pretrained Descript Audio Codec substantially improves convergence and final enhancement quality compared with training from scratch. This pragmatic transfer reduces training instability and speeds up learning.
Results and comparisons
On the VCTK+DEMAND simulated benchmark, USE-DDP reports parity with strong unsupervised baselines such as unSE and unSE+ (optimal-transport-based approaches). It achieves competitive DNSMOS compared with MetricGAN-U, even though MetricGAN-U directly optimizes DNSMOS. Representative figures from the paper: DNSMOS goes from 2.54 (noisy input) to about 3.03 with USE-DDP, and PESQ improves from around 1.97 to roughly 2.47. CBAK scores lag some baselines because USE-DDP applies stronger attenuation in non-speech segments, a byproduct of the explicit noise prior.
The priors matter — a central finding
A key takeaway is that the choice of the clean-speech prior materially affects outcomes. Using an in-domain clean corpus (e.g., VCTK) on VCTK+DEMAND mixtures yields the best numbers (DNSMOS ≈ 3.03), but that configuration partially ‘‘peeks’’ at the target distribution used to synthesize the mixtures and can overstate performance on simulated tests. An out-of-domain clean prior leads to lower scores (PESQ dropping to ~2.04 in some runs) and some noise leakage into the clean branch.
On real-world data (CHiME-3), using a close-talk channel as the in-domain clean prior actually degrades performance due to environmental bleed in that reference; instead, a truly clean out-of-domain corpus improved DNSMOS/UTMOS on dev and test sets, albeit with trade-offs in intelligibility under aggressive suppression. These observations clarify why prior unsupervised results can vary and underscore the need for transparent reporting of prior selection when claiming state-of-the-art on simulated benchmarks.
Why this matters
USE-DDP reframes enhancement as explicit two-source estimation constrained by data-defined priors, instead of directly chasing a particular evaluation metric. The reconstruction constraint plus adversarial priors provides a clear inductive bias, and codec-based initialization is a practical stabilization strategy. The approach is competitive with unsupervised baselines while keeping training purely data-driven, but reported gains depend strongly on which clean-speech corpus is chosen as the prior.
Read the full paper at https://arxiv.org/pdf/2509.22942. The authors also provide code and tutorials on their project pages linked in the paper.