<RETURN_TO_BASE

NVIDIA Unveils Audio-SDS: Revolutionizing Audio Synthesis and Source Separation with Diffusion Models

NVIDIA and MIT introduce Audio-SDS, a novel framework applying Score Distillation Sampling to audio diffusion models, enabling diverse audio tasks like synthesis and source separation without specialized datasets.

Bridging Generative Models with Parametric Audio Synthesis

Audio diffusion models have made significant strides in generating high-quality speech, music, and Foley sounds. However, their primary strength lies in sample generation rather than optimizing parameters that control sound characteristics. Tasks such as physically informed impact sound creation or prompt-guided source separation require models capable of adjusting explicit and interpretable parameters within structural constraints. Score Distillation Sampling (SDS), a technique that has advanced text-to-3D and image editing by backpropagating through pretrained diffusion priors, had not been applied to audio until now. By adapting SDS to audio diffusion, researchers can optimize parametric audio representations without the need for large, task-specific datasets, effectively merging modern generative models with classical parameterized synthesis workflows.

Leveraging Classic Audio Techniques and Diffusion Priors

Traditional audio methods like frequency modulation (FM) synthesis—which utilizes operator-modulated oscillators to craft rich timbres—and physically grounded impact sound simulators offer compact and interpretable parameter spaces. Meanwhile, source separation has evolved from matrix factorization to neural and text-guided approaches that isolate elements such as vocals or instruments. Integrating SDS updates with pretrained audio diffusion models enables the optimization of FM parameters, impact-sound simulators, or separation masks directly from high-level prompts. This approach combines the interpretability of signal processing with the flexibility of modern diffusion-based generation.

Introducing Audio-SDS: A Unified Framework

Researchers from NVIDIA and MIT have introduced Audio-SDS, an extension of SDS tailored for text-conditioned audio diffusion models. Audio-SDS uses a single pretrained model to tackle multiple audio tasks without relying on specialized datasets. By distilling generative priors into parametric audio representations, it facilitates tasks like impact sound simulation, FM synthesis parameter tuning, and source separation. The framework merges data-driven priors with explicit parameter control, producing perceptually convincing audio outputs. Key enhancements include a stable decoder-based SDS, multistep denoising, and a multiscale spectrogram method that improves high-frequency detail and realism.

Technical Innovations and Applications

The study details the application of SDS to audio diffusion models inspired by DreamFusion. SDS generates stereo audio via a rendering function, which enhances performance by bypassing encoder gradients and focusing on decoded audio. Three main improvements optimize the methodology: avoiding encoder instability, emphasizing spectrogram features to enhance high-frequency details, and implementing multi-step denoising for greater stability. Audio-SDS demonstrates versatility across several applications, including FM synthesis, impact sound generation, and source separation. These use cases show how SDS adapts across different audio domains without retraining, ensuring synthesized audio closely matches textual prompts while maintaining high fidelity.

Performance Evaluation

Audio-SDS's performance was evaluated on FM synthesis, impact synthesis, and source separation tasks. Experiments employed subjective listening tests and objective metrics like the CLAP score, distance to ground truth, and Signal-to-Distortion Ratio (SDR). Pretrained models such as the Stable Audio Open checkpoint were utilized. Results indicate significant improvements in both audio synthesis and source separation, with clear alignment to the given text prompts.

Audio-SDS represents a promising direction in audio generation, unifying data-driven generative priors with explicit, interpretable parameter control. This eliminates the dependency on extensive, domain-specific datasets and opens new avenues for multimodal audio research.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский