Expressive Speech Synthesis with Tacotron

Expressive Speech Synthesis with Tacotron

  • March 28, 2018
Table of Contents

Expressive Speech Synthesis with Tacotron

At Google, we’re excited about the recent rapid progress of neural network-based text-to-speech (TTS) research. In particular, end-to-end architectures, such as the Tacotron systems we announced last year, can both simplify voice building pipelines and produce natural-sounding speech. This will help us build better human-computer interfaces, like conversational assistants, audiobook narration, news readers, or voice design software.

To deliver a truly human-like voice, however, a TTS system must learn to model prosody, the collection of expressive factors of speech, such as intonation, stress, and rhythm. Most current end-to-end systems, including Tacotron, don’t explicitly model prosody, meaning they can’t control exactly how the generated speech should sound. This may lead to monotonous-sounding speech, even when models are trained on very expressive datasets like audiobooks, which often contain character voices with significant variation.

Today, we are excited to share two new papers that address these problems.

Source: googleblog.com

Tags :
Share :
comments powered by Disqus

Related Posts

China will publicly shame jaywalkers using facial-recognition technology

China will publicly shame jaywalkers using facial-recognition technology

The AI company behind the billboards, Intellifusion, is in talks with mobile phone networks and local social media platforms to enforce the new system.

Read More
Using Machine Learning to Improve Streaming Quality at Netflix

Using Machine Learning to Improve Streaming Quality at Netflix

Network quality is difficult to characterize and predict. While the average bandwidth and round trip time supported by a network are well-known indicators of network quality, other characteristics such as stability and predictability make a big difference when it comes to video streaming. A richer characterization of network quality would prove useful for analyzing networks (for targeting/analyzing product improvements), determining initial video quality and/or adapting video quality throughout playback (more on that below).

Read More