AI Text-to-Speech Can Now 'Unlearn' Voices to Combat Audio Deepfakes
New AI techniques enable text-to-speech models to 'unlearn' specific voices, drastically reducing the risk of audio deepfakes and voice cloning scams while maintaining overall performance.
The Rise of Audio Deepfakes and Voice Cloning
Recent breakthroughs in AI have enabled text-to-speech systems to replicate voices with striking realism, mimicking natural intonations and speech patterns. This technology allows anyone’s voice to be reproduced from just a few seconds of audio, raising concerns over misuse in scams, disinformation, and harassment.
Introducing Machine Unlearning for Speech
A novel approach called "machine unlearning" aims to teach AI models to forget specific voices, effectively preventing the AI from replicating them. This technique not only removes particular voice data from the model’s training but can also stop the model from mimicking voices it was never trained on.
How Machine Unlearning Works
Traditional AI companies rely on guardrails to prevent misuse, filtering inputs and outputs for disallowed content. Machine unlearning takes a different approach: it modifies the original AI model by removing the influence of specific training data, creating a new model that behaves as if it never learned that data.
Challenges with Zero-Shot Text-to-Speech Models
Modern text-to-speech models operate in "zero-shot" mode, meaning they can mimic voices outside their training data given a small voice sample. Unlearning, therefore, must erase these voices’ influence without degrading the model’s ability to mimic other permitted voices.
Demonstration with VoiceBox Model
Researchers at Sungkyunkwan University applied machine unlearning to a recreation of Meta’s VoiceBox model. When prompted to reproduce a "forgotten" voice, the model instead responds with a random voice it generates itself. This method reduces the similarity to the forgotten voice by over 75%, while only slightly (2.8%) reducing performance for allowed voices.
Practical Considerations and Limitations
The unlearning process can take several days per voice and requires approximately five minutes of audio per speaker to be forgotten. The technique replaces forgotten voice data with high randomness to prevent reverse engineering. However, there is an inherent trade-off between the model’s forgetfulness and its overall usability.
Future Prospects
Although still in early stages, machine unlearning shows promise for real-world deployment to combat voice-based fraud and abuse. Researchers are actively seeking faster and more scalable solutions to make voice unlearning practical for widespread use.
Сменить язык
Читать эту статью на русском