Meta introduces Voicebox, does a first on Generative AI speech

Meta introduces Voicebox which produce high-quality audio clips in a wide variety of styles.

By: MEDHA JHA
| Updated on: Jun 17 2023, 23:59 IST

Voicebox can generalize to speech-generation tasks (Meta AI)

Meta AI researchers have moved a step forward in the field of generative AI for speech with the development of Voicebox. Unlike previous models, Voicebox can generalize to speech-generation tasks that it was not specifically trained for, demonstrating state-of-the-art performance.

Voicebox is a versatile generative system for speech that can produce high-quality audio clips in a wide variety of styles. It can create outputs from scratch or modify existing samples. The model supports speech synthesis in six languages, as well as noise removal, content editing, style conversion, and diverse sample generation.

Traditionally, generative AI models for speech required specific training for each task using carefully prepared training data. However, Voicebox adopts a new approach called Flow Matching, which surpasses diffusion models in performance. It outperforms existing state-of-the-art models like VALL-E for English text-to-speech tasks, achieving better word error rates (5.9% vs. 1.9%) and audio similarity (0.580 vs. 0.681), while also being up to 20 times faster. In cross-lingual style transfer, Voicebox surpasses YourTTS by reducing word error rates from 10.9% to 5.2% and improving audio similarity from 0.335 to 0.481.

One of the main limitations of existing speech synthesizers is that they rely on monotonic. They clean data that is difficult to produce and limited in quantity. However, Voicebox overcomes this limitation by leveraging the non-deterministic mapping capabilities of the Flow Matching model. This allows Voicebox to learn from a diverse range of speech data without the need for meticulous labeling. The model was trained on over 50,000 hours of recorded speech and transcripts from public domain audiobooks in multiple languages.

Voice box can perform a variety of task including:

1-In-context text-to-speech synthesis: Voicebox's versatility enables it to excel in various speech generation tasks. It can perform in-context text-to-speech synthesis by matching the audio style of a given input sample and using it for generating speech from text. This capability has potential applications in assisting people who are unable to speak or customizing voices for non-player characters and virtual assistants.

2-Cross-lingual style transfer: Voicebox demonstrates proficiency in cross-lingual style transfer. By providing a sample of speech and a text passage in one of the supported languages, i.e English, French, German, Spanish, Polish, or Portuguese, Voicebox can produce a reading of the text in that language. This feature has the potential to facilitate natural and authentic communication between individuals who speak different languages.

3-Speech denoising and editing:

Voicebox also excels in speech denoising and editing tasks. Leveraging its in-context learning, the model can generate speech to seamlessly edit segments within audio recordings. It can replace misspoken words or synthesize portions corrupted by short-duration noise, without requiring the re-recording of the entire speech. This capability simplifies the process of cleaning up and editing audio recordings, similar to popular image-editing tools for adjusting photos.

4- Voicebox's ability to learn from diverse, real-world data allows it to generate speech that better represents how people naturally communicate in the six supported languages. This capability can be leveraged to generate synthetic data for training speech assistant models. Models trained on Voicebox-generated synthetic speech exhibit similar performance to models trained on real speech, with only a 1% error rate degradation compared to the significant degradation observed with synthetic speech from previous text-to-speech models.

While the researchers acknowledge the exciting use cases for generative speech models, they have decided not to make the Voicebox model or code publicly available at this time due to the potential risks of misuse. Responsible development and use of AI are paramount, and striking a balance between openness and responsibility is crucial. Instead, the researchers have shared audio samples and a research paper detailing the approach, results, and the creation of an effective classifier to distinguish between authentic speech and audio generated with Voicebox.

Catch all the Latest Tech News, Mobile News, Laptop News, Gaming news, Wearables News , How To News, also keep up with us on Whatsapp channel,Twitter, Facebook, Google News, and Instagram. For our latest videos, subscribe to our YouTube channel.