Expressive Speech Synthesis with Tacotron

Expressive Speech Synthesis with Tacotron

  • March 28, 2018
Table of Contents

Expressive Speech Synthesis with Tacotron

At Google, we’re excited about the recent rapid progress of neural network-based text-to-speech (TTS) research. In particular, end-to-end architectures, such as the Tacotron systems we announced last year, can both simplify voice building pipelines and produce natural-sounding speech. This will help us build better human-computer interfaces, like conversational assistants, audiobook narration, news readers, or voice design software.

To deliver a truly human-like voice, however, a TTS system must learn to model prosody, the collection of expressive factors of speech, such as intonation, stress, and rhythm. Most current end-to-end systems, including Tacotron, don’t explicitly model prosody, meaning they can’t control exactly how the generated speech should sound. This may lead to monotonous-sounding speech, even when models are trained on very expressive datasets like audiobooks, which often contain character voices with significant variation.

Today, we are excited to share two new papers that address these problems.

Source: googleblog.com

Tags :
Share :
comments powered by Disqus

Related Posts

AI Cardiologist Aces Its First Medical Exam

AI Cardiologist Aces Its First Medical Exam

When both the AI and expert cardiologists were asked to classify the images, the AI achieved an accuracy of 92 percent. The humans got only 79 percent correct.

Read More
Does my algorithm have a mental-health problem?

Does my algorithm have a mental-health problem?

Is my car hallucinating? Is the algorithm that runs the police surveillance system in my city paranoid? Marvin the android in Douglas Adams’s Hitchhikers Guide to the Galaxy had a pain in all the diodes down his left-hand side.

Read More