AI Text-to-Speech Can Now 'Unlearn' Voices to Combat Audio Deepfakes

The Rise of Audio Deepfakes and Voice Cloning

Recent breakthroughs in AI have enabled text-to-speech systems to replicate voices with striking realism, mimicking natural intonations and speech patterns. This technology allows anyone’s voice to be reproduced from just a few seconds of audio, raising concerns over misuse in scams, disinformation, and harassment.

Introducing Machine Unlearning for Speech

A novel approach called "machine unlearning" aims to teach AI models to forget specific voices, effectively preventing the AI from replicating them. This technique not only removes particular voice data from the model’s training but can also stop the model from mimicking voices it was never trained on.

How Machine Unlearning Works

Traditional AI companies rely on guardrails to prevent misuse, filtering inputs and outputs for disallowed content. Machine unlearning takes a different approach: it modifies the original AI model by removing the influence of specific training data, creating a new model that behaves as if it never learned that data.

Challenges with Zero-Shot Text-to-Speech Models

Modern text-to-speech models operate in "zero-shot" mode, meaning they can mimic voices outside their training data given a small voice sample. Unlearning, therefore, must erase these voices’ influence without degrading the model’s ability to mimic other permitted voices.

Demonstration with VoiceBox Model

Researchers at Sungkyunkwan University applied machine unlearning to a recreation of Meta’s VoiceBox model. When prompted to reproduce a "forgotten" voice, the model instead responds with a random voice it generates itself. This method reduces the similarity to the forgotten voice by over 75%, while only slightly (2.8%) reducing performance for allowed voices.

Practical Considerations and Limitations

The unlearning process can take several days per voice and requires approximately five minutes of audio per speaker to be forgotten. The technique replaces forgotten voice data with high randomness to prevent reverse engineering. However, there is an inherent trade-off between the model’s forgetfulness and its overall usability.

Future Prospects

Although still in early stages, machine unlearning shows promise for real-world deployment to combat voice-based fraud and abuse. Researchers are actively seeking faster and more scalable solutions to make voice unlearning practical for widespread use.