TY - JOUR
T1 - Statistical voice conversion based on noisy channel model
AU - Saito, Daisuke
AU - Watanabe, Shinji
AU - Nakamura, Atsushi
AU - Minematsu, Nobuaki
N1 - Funding Information:
Manuscript received June 16, 2011; revised December 03, 2011; accepted January 26, 2012. Date of publication February 22, 2012; date of current version April 04, 2012. The part of this work is conducted when the first author was an internship student of NTT Communication Science Laboratories, and the second author was with NTT Communication Science Laboratories, NTT Corporation. It is conducted as the joint research project of NTT Corporation and The University of Tokyo. This work was supported in part by KAKENHI Grant-in-Aid for JSPS Fellows (22-8861). The work of D. Saito was supported by the Japan Society for the Promotion of Science. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Chung-Hsien Wu.
PY - 2012
Y1 - 2012
N2 - This paper describes a novel framework of voice conversion effectively using both a joint density model and a speaker model. In voice conversion studies, approaches based on the Gaussian mixture model (GMM) with probabilistic densities of joint vectors of a source and a target speakers are widely used to estimate a transform function between both the speakers. However, to achieve sufficient quality, these approaches require a parallel corpus which contains plenty of utterances with the same linguistic content spoken by both the speakers. In addition, the joint density GMM methods often suffer from overtraining effects when the amount of training data is small. To compensate for these problems, we propose a voice conversion framework, which integrates the speaker GMM of the target with the joint density model using a noisy channel model. The proposed method trains the joint density model with a few parallel utterances, and the speaker model with nonparallel data of the target, independently. It can ease the burden on the source speaker. Experiments demonstrate the effectiveness of the proposed method, especially when the amount of the parallel corpus is small.
AB - This paper describes a novel framework of voice conversion effectively using both a joint density model and a speaker model. In voice conversion studies, approaches based on the Gaussian mixture model (GMM) with probabilistic densities of joint vectors of a source and a target speakers are widely used to estimate a transform function between both the speakers. However, to achieve sufficient quality, these approaches require a parallel corpus which contains plenty of utterances with the same linguistic content spoken by both the speakers. In addition, the joint density GMM methods often suffer from overtraining effects when the amount of training data is small. To compensate for these problems, we propose a voice conversion framework, which integrates the speaker GMM of the target with the joint density model using a noisy channel model. The proposed method trains the joint density model with a few parallel utterances, and the speaker model with nonparallel data of the target, independently. It can ease the burden on the source speaker. Experiments demonstrate the effectiveness of the proposed method, especially when the amount of the parallel corpus is small.
KW - Joint density model
KW - noisy channel model
KW - probabilistic integration
KW - speaker model
KW - voice conversion (VC)
UR - http://www.scopus.com/inward/record.url?scp=84859768504&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84859768504&partnerID=8YFLogxK
U2 - 10.1109/TASL.2012.2188628
DO - 10.1109/TASL.2012.2188628
M3 - Article
AN - SCOPUS:84859768504
SN - 1558-7916
VL - 20
SP - 1784
EP - 1794
JO - IEEE Transactions on Audio, Speech and Language Processing
JF - IEEE Transactions on Audio, Speech and Language Processing
IS - 6
M1 - 6156420
ER -