Meta unveils speech-to-text, text-to-speech AI models for over 1,100 languages; even shares open source data | Tech News

Meta unveils speech-to-text, text-to-speech AI models for over 1,100 languages; even shares open source data

Meta has unveiled its speech-to-text, text-to-speech AI models for over 1,100 languages.

| Updated on: May 23 2023, 20:30 IST
iPhone moment? Meta smart glasses, in Star Trek style, could launch as early as 2024
1/5 According to a conversation by Meta insiders with the Verge, the venture is called Project Nazare and is scheduled to launch in 2024. It will arrive in three iterations, them being in 2024, 2026 and later in 2028. (Facebook/Meta)
2/5 Although there is no working prototype yet, Facebook CEO Mark Zuckerberg still wants smart glasses to launch within two years. According to a former employee, Zuckerberg wants the launch of AR smart glasses to be Meta’s own Apple “iPhone” moment. (Bloomberg)
3/5 The smart glasses don’t need to be tethered to a mobile phone, therefore avoiding the terms under which apps such as Facebook are mandated to operate. (Bloomberg)
4/5 Features such as 70-degree field of view, light weight, eye-tracking movement, front camera and stereo speakers are expected in the first version of smart glasses. Coming to price, it is expected to be subsidized as smart glasses will have a huge development cost. (AP)
5/5 Meta has already spent billions of dollars in the development of its AR smart glasses but is expecting the first iteration to have low sales. Along with Meta, companies like Apple have also already begun work towards development of smart glasses and it aims to launch its Mixed Reality glasses as early as late this year or in 2023. (REUTERS)
icon View all Images
Meta says it will make the models open source, allowing developers to freely make new speech apps. (REUTERS)

All the tech majors are in a fierce fight over delivering utility to users in the form of artificial intelligence (AI) boosted products. While everyone knows about OpenAI's ChatGPT and Google's Bard, there was very little available on it from Facebook co-founder Mark Zuckerberg's Meta Platforms. Till today, that is. Now, the company has launched its speech-to-text, text-to-speech AI models for over 1,100 languages and the best part is that it is not linked to ChatGPT. Check out the Massively Multilingual Speech (MMS) project.

The biggest takeaway is that Meta has shared the open source and that means it could lead to a skyrocketing of the number of speech apps created across the world.

You may be interested in

MobilesTablets Laptops
Apple iPhone 15 Pro Max
  • Black Titanium
  • 8 GB RAM
  • 256 GB Storage
27% OFF
Samsung Galaxy S23 Ultra 5G
  • Green
  • 12 GB RAM
  • 256 GB Storage
Google Pixel 8 Pro
  • Obsidian
  • 12 GB RAM
  • 128 GB Storage
Apple iPhone 15 Plus
  • Black
  • 6 GB RAM
  • 128 GB Storage

If all goes well in the real world, how useful this can be is clear from Meta's statement, "Existing speech recognition models only cover approximately 100 languages — a fraction of the 7,000+ known languages spoken on the planet."

Also read
Looking for a smartphone? To check mobile finder click here.

Data Crunching

Now, good machine-learning models require large amounts of labeled data — in this case, many thousands of hours of audio, along with transcriptions. For most languages, this data simply does not exist.

However, Meta has overcome that through its MMS project, which combined wav2vec 2.0, its pioneering work in self-supervised learning, and a new dataset that provides labeled data for over 1,100 languages and unlabeled data for nearly 4,000 languages.

Patting itself on the back, Meta, in a statement said, "Our results show that the Massively Multilingual Speech models outperform existing models and cover 10 times as many languages."

It also revealed that, "Today, we are publicly sharing our models and code so that others in the research community can build upon our work. Through this work, we hope to make a small contribution to preserve the incredible language diversity of the world."

How Meta did it

The MMS project's first job was to collect audio data for thousands of languages, but the largest existing speech datasets covered at most 100 languages. The challenge was overcome by "turning to religious texts, such as the Bible, that have been translated in many different languages and whose translations have been widely studied for text-based language translation research".

The MMS project even created a dataset of readings of the New Testament in over 1,100 languages.

Having sensed that the idea was good and that it could be milked for much more, the project also considered unlabeled recordings of various other Christian religious readings. This increased the number of languages available to over 4,000.

Bias, what bias?

EVen though the data is from a specific domain, the biases seemed not to have entered into the system. This is clear from the fact that even though this text is often read by male speakers, Meta analysis showed that its MMS models perform equally well for male and female voices.

And, importantly, though the content of the audio recordings is religious, MMS analysis shows that this does not overly bias the model to produce more religious language.

Meta credits this success to the use of the Connectionist Temporal Classification approach, which it found to be better than the large language models (LLMs) or sequence to-sequence models for speech recognition.

How it was made usable

Meta preprocessed the data to make it usable by machine learning algorithms by training an alignment model on existing data in over 100 languages.

To reduce the error rate, Meta said, "We applied multiple rounds of this process and performed a final cross-validation filtering step based on model accuracy to remove potentially misaligned data.

Results obtained

Meta trained multilingual speech recognition models on over 1,100 languages. The consequence of this was explained by Meta in this way, "As the number of languages increases, performance does decrease, but only very slightly: Moving from 61 to 1,107 languages increases the character error rate by only about 0.4 percent but increases the language coverage by over 18 times."

MMS vs OpenAI Whisper

In a like-for-like comparison with Whisper, Meta said that models trained on the Massively Multilingual Speech data achieve only half the word error rate, but importantly, Massively Multilingual Speech covers 11 times more languages.

Catch all the Latest Tech News, Mobile News, Laptop News, Gaming news, Wearables News , How To News, also keep up with us on Whatsapp channel,Twitter, Facebook, Google News, and Instagram. For our latest videos, subscribe to our YouTube channel.

First Published Date: 23 May, 20:20 IST