TY - JOUR
T1 - Prosody control of utterance sequence for information delivering
AU - Fukuoka, Ishin
AU - Iwata, Kazuhiko
AU - Kobayashi, Tetsunori
N1 - Publisher Copyright:
Copyright © 2017 ISCA.
PY - 2017
Y1 - 2017
N2 - We propose a conversational speech synthesis system in which the prosodic features of each utterance are controlled throughout the entire input text. We have developed a "news-telling system," which delivered news articles through spoken language. The speech synthesis system for the news-telling should be able to highlight utterances containing noteworthy information in the article with a particular way of speaking so as to impress them on the users. To achieve this, we introduced role and position features of the individual utterances in the article into the control parameters for prosody generation throughout the text. We defined three categories for the role feature: a nucleus (which is assigned to the utterance including the noteworthy information), a front satellite (which precedes the nucleus) and a rear satellite (which follows the nucleus). We investigated how the prosodic features differed depending on the role and position features through an analysis of news-telling speech data uttered by a voice actress. We designed the speech synthesis system on the basis of a deep neural network having the role and position features added to its input layer. Objective and subjective evaluation results showed that introducing those features was effective in the speech synthesis for the information delivering.
AB - We propose a conversational speech synthesis system in which the prosodic features of each utterance are controlled throughout the entire input text. We have developed a "news-telling system," which delivered news articles through spoken language. The speech synthesis system for the news-telling should be able to highlight utterances containing noteworthy information in the article with a particular way of speaking so as to impress them on the users. To achieve this, we introduced role and position features of the individual utterances in the article into the control parameters for prosody generation throughout the text. We defined three categories for the role feature: a nucleus (which is assigned to the utterance including the noteworthy information), a front satellite (which precedes the nucleus) and a rear satellite (which follows the nucleus). We investigated how the prosodic features differed depending on the role and position features through an analysis of news-telling speech data uttered by a voice actress. We designed the speech synthesis system on the basis of a deep neural network having the role and position features added to its input layer. Objective and subjective evaluation results showed that introducing those features was effective in the speech synthesis for the information delivering.
KW - Conversational speech
KW - Discourse analysis
KW - Neural network
KW - Prosody
KW - Speech synthesis
UR - http://www.scopus.com/inward/record.url?scp=85039148476&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85039148476&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2017-708
DO - 10.21437/Interspeech.2017-708
M3 - Conference article
AN - SCOPUS:85039148476
SN - 2308-457X
VL - 2017-August
SP - 774
EP - 778
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017
Y2 - 20 August 2017 through 24 August 2017
ER -