Microsoft's Breakthrough in AI Speech Generation: VALL-E 2

Microsoft's Breakthrough in AI Speech Generation: VALL-E 2

Microsoft has made a major breakthrough in the field of AI speech generation with their latest development, VALL-E 2. This text-to-speech (TTS) generator is so advanced that its creators believe it cannot be released to the public. The research paper, published on the pre-print server arXiv, describes VALL-E 2 as being capable of producing “accurate, natural speech in the exact voice of the original speaker, comparable to human performance.” This means that the AI-generated voice is so convincing that it could be mistaken for a real person.

The achievement of “human parity” is a significant milestone in the field of zero-shot text-to-speech synthesis. Microsoft researchers have incorporated two key features into VALL-E 2 to achieve this level of quality. The first feature, Repetition Aware Sampling, addresses the issue of repetitive phrases or sounds that can make AI-generated speech sound unnatural. By varying the pattern of speech, VALL-E 2 is able to produce more fluid and natural-sounding output. The second feature, Grouped Code Modeling, improves efficiency by reducing the length of input sequences. This allows VALL-E 2 to generate speech more quickly and effectively process longer strings of sounds.

To evaluate the performance of VALL-E 2, the researchers used audio samples from speech libraries and an evaluation framework called ELLA-V. The results showed that VALL-E 2 outperformed previous zero-shot TTS systems in terms of speech robustness, naturalness, and speaker similarity. It surpassed human speech in benchmarks, making it the first system of its kind to achieve human parity.

Despite its impressive capabilities, Microsoft has no plans to release VALL-E 2 to the public. This decision is driven by concerns over potential misuse, particularly in the realm of voice cloning and deepfake technology. The company views VALL-E 2 as a purely research project and has no intention of incorporating it into any products or expanding public access. Safeguarding against risks such as voice identification spoofing or impersonation is a major factor in this decision.

However, the researchers suggest that AI speech technology could have practical applications in the future. They mention education, entertainment, journalism, self-authored content, accessibility features, interactive voice response systems, translation, and chatbots as areas where AI-generated speech could be used. They emphasize the importance of obtaining permission from speakers before using their voices and the need for a detection model to identify synthesized speech.

In conclusion, Microsoft’s VALL-E 2 represents a significant advancement in AI speech generation. Its ability to achieve human parity in generating accurate and natural speech is a remarkable achievement. While the technology will not be released to the public due to potential misuse risks, it opens up exciting possibilities for the future of AI-generated speech in various fields and applications.


Written By

Jiri Bílek

In the vast realm of AI and U.N. directives, Jiri crafts tales that bridge tech divides. With every word, he champions a world where machines serve all, harmoniously.