TY - GEN
T1 - On Prosody Modeling for ASR+TTS Based Voice Conversion
AU - Huang, Wen Chin
AU - Hayashi, Tomoki
AU - Li, Xinjian
AU - Watanabe, Shinji
AU - Toda, Tomoki
N1 - Funding Information:
This work was partly supported by JSPS KAKENHI Grant Number 21J20920 and JST CREST Grant Number JPMJCR19A3, Japan. We would also like to thank Yu-Huai Peng and Hung-Shin Lee from Academia Sinica, Taiwan, for training the BNF extractor.
Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - In voice conversion (VC), an approach showing promising results in the latest voice conversion challenge (VCC) 2020 is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic contents; these are then used as input by a text-to-speech (TTS) system to generate the converted speech. Such a paradigm, referred to as ASR+TTS, overlooks the modeling of prosody, which plays an important role in speech naturalness and conversion similarity. Although some researchers have considered transferring prosodic clues from the source speech, there arises a speaker mismatch during training and conversion. To address this issue, in this work, we propose to directly predict prosody from the linguistic representation in a target-speaker-dependent manner, referred to as target text prediction (TTP). We evaluate both methods on the VCC2020 benchmark and consider different linguistic representations. The results demonstrate the effectiveness of TTP in both objective and subjective evaluations.
AB - In voice conversion (VC), an approach showing promising results in the latest voice conversion challenge (VCC) 2020 is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic contents; these are then used as input by a text-to-speech (TTS) system to generate the converted speech. Such a paradigm, referred to as ASR+TTS, overlooks the modeling of prosody, which plays an important role in speech naturalness and conversion similarity. Although some researchers have considered transferring prosodic clues from the source speech, there arises a speaker mismatch during training and conversion. To address this issue, in this work, we propose to directly predict prosody from the linguistic representation in a target-speaker-dependent manner, referred to as target text prediction (TTP). We evaluate both methods on the VCC2020 benchmark and consider different linguistic representations. The results demonstrate the effectiveness of TTP in both objective and subjective evaluations.
KW - automatic speech recognition
KW - global style token
KW - prosody
KW - text-to-speech
KW - voice conversion
UR - http://www.scopus.com/inward/record.url?scp=85117754614&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85117754614&partnerID=8YFLogxK
U2 - 10.1109/ASRU51503.2021.9688010
DO - 10.1109/ASRU51503.2021.9688010
M3 - Conference contribution
AN - SCOPUS:85117754614
T3 - 2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings
SP - 642
EP - 649
BT - 2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021
Y2 - 13 December 2021 through 17 December 2021
ER -