Meta’s updated AI makes text-to-speech generation more seamless and expressive

5 months ago 85

Meta’s updated AI makes text-to-speech generation more seamless and expressive

Meta introduced its multimodal AI translation model called SeamlessM4T in August. This tool supports almost 100 languages for text and 36 languages for speech. Now, with an updated "v2" architecture, the company is expanding the tool's capabilities to make conversational translations more spontaneous and expressive. This is a crucial step towards more authentic conversations across languages, as the lack of expressive translations has been a major challenge so far.

The SeamlessM4T is designed to translate and transcribe seamlessly across various speech and text functions. It can translate nearly 100 languages for speech-to-text and text-to-text functions while supporting speech-to-speech and text-to-speech capabilities in the same languages. Additionally, it can output the translations in any of the 36 other languages, including English.

The first of the two new features is called "SeamlessExpressive." As the name suggests, it allows your expressions to be translated along with your speech. This includes your pitch, volume, emotional tone (e.g., excitement, sadness, or whispers), speech rate, and pauses. This makes translated speeches sound less robotic and more natural. The feature supports several languages, including English, Spanish, German, French, Italian, and Chinese.

The second feature is called "SeamlessStreaming". It enables the tool to start translating a speech while the speaker is still talking, making it faster for others to hear a translation. Although there is a short latency of just under two seconds, it eliminates the need to wait until someone finishes a sentence. The challenge here is that different languages have different sentence structures, so Meta had to develop an algorithm that can study partial audio input to determine whether there is enough context to start generating a translated output or whether it should keep listening.

SeamlessM4T is developed on the existing PyTorch-based multitask UnitY model architecture. This architecture already has the ability to perform different modal translations as well as automatic speech recognition. Additionally, the model makes use of the BERT 2.0 system for audio encoding, which breaks down inputs into their component tokens for analysis, and a HiFi-GAN unit vocoder to generate spoken responses.

FacebookTwitterLinkedin



end of article

Read Entire Article