Microsoft researchers announced a new text-to-speech AI model called VALL-E, which can accurately imitate a human voice given a 3-second audio sample. After recognizing a particular voice, VALL-E can synthesize that person’s voice, and do so in a way that preserves the emotional tone of the speaker, reports ArsTechnica.

The creators of VALL-E suggest that it can be used for high-quality text-to-speech applications and audio content creation in combination with other generative AI models such as GPT-3.

Microsoft calls VALL-E the “Neural Codec Language Model,” and it’s built on the EnCodec technology that Meta introduced in October 2022. Unlike other text-to-speech methods, which typically synthesize text-to-speech by manipulating signals, VALL-E generates separate audio codecs codes from text and audio cues. Essentially, it analyzes how a person sounds, breaks that information down into individual components (called “tokens”) thanks to EnCodec, and uses training data to match what the AI “knows” about how that voice would sound if it spoke other phrases.

On the VALL-E website, Microsoft provides dozens of audio examples of the AI model in action. Among the “Speaker Prompt” samples is three seconds of audio provided by VALL-E, which it is supposed to emulate. A “Ground Truth” is a pre-existing recording of the same speaker saying a certain phrase for comparison (like the “control” in the experiment). “Baseline” is an example of the synthesis provided by conventional text-to-speech synthesis, and the example “VALL-E” is the result of the VALL-E model.

Microsoft's VALL-E can simulate any voice with 3 seconds of audio

In addition to preserving the speaker’s vocal timbre and emotional tone, VALL-E can also simulate the “acoustic environment” of an audio sample. For example, if the sample came from a phone call, it will mimic the acoustic and frequency characteristics of the phone call. And Microsoft’s samples (under “Synthesis of diversity”) demonstrate that VALL-E can generate vocal tone variations by changing the random seed used in the generation process.

The researchers seem to be aware of the potential social harm this technology could cause, stating the following:

“Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E. We will also put Microsoft AI Principles into practice when further developing the models.”