Pre-trained text embeddings for enhanced text-to-speech synthesis

Tomoki Hayashi, Shinji Watanabe, Tomoki Toda, Kazuya Takeda, Shubham Toshniwal, Karen Livescu

研究成果: Conference article査読

54 被引用数 (Scopus)


We propose an end-to-end text-to-speech (TTS) synthesis model that explicitly uses information from pre-trained embeddings of the text. Recent work in natural language processing has developed self-supervised representations of text that have proven very effective as pre-training for language understanding tasks. We propose using one such pre-trained representation (BERT) to encode input phrases, as an additional input to a Tacotron2-based sequence-to-sequence TTS model. We hypothesize that the text embeddings contain information about the semantics of the phrase and the importance of each word, which should help TTS systems produce more natural prosody and pronunciation. We conduct subjective listening tests of our proposed models using the 24-hour LJSpeech corpus, finding that they improve mean opinion scores modestly but significantly over a baseline TTS model without pre-trained text embedding input.

ジャーナルProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
出版ステータスPublished - 2019
イベント20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019 - Graz, Austria
継続期間: 2019 9月 152019 9月 19

ASJC Scopus subject areas

  • 言語および言語学
  • 人間とコンピュータの相互作用
  • 信号処理
  • ソフトウェア
  • モデリングとシミュレーション


「Pre-trained text embeddings for enhanced text-to-speech synthesis」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。