TY - JOUR
T1 - Speaker adversarial training of DPGMM-based feature extractor for zero-resource languages
AU - Higuchi, Yosuke
AU - Tawara, Naohiro
AU - Kobayashi, Tetsunori
AU - Ogawa, Tetsuji
N1 - Funding Information:
This work was supported by JSPS KAKENHI Grant Number 17K12718.
Publisher Copyright:
Copyright © 2019 ISCA
PY - 2019
Y1 - 2019
N2 - We propose a novel framework for extracting speaker-invariant features for zero-resource languages. A deep neural network (DNN)-based acoustic model is normalized against speakers via adversarial training: a multi-task learning process trains a shared bottleneck feature to be discriminative to phonemes and independent of speakers. However, owing to the absence of phoneme labels, zero-resource languages cannot employ adversarial multi-task (AMT) learning for speaker normalization. In this work, we obtain a posteriorgram from a Dirichlet process Gaussian mixture model (DPGMM) and utilize the posterior vector for supervision of the phoneme estimation in the AMT training. The AMT network is designed so that the DPGMM posteriorgram itself is embedded in a speaker-invariant feature space. The proposed network is expected to resolve the potential problem that the posteriorgram may lack reliability as a phoneme representation if the DPGMM components are intermingled with phoneme and speaker information. Based on the Zero Resource Speech Challenges, we conduct phoneme discriminant experiments on the extracted features. The results of the experiments show that the proposed framework extracts discriminative features, suppressing the variety in speakers.
AB - We propose a novel framework for extracting speaker-invariant features for zero-resource languages. A deep neural network (DNN)-based acoustic model is normalized against speakers via adversarial training: a multi-task learning process trains a shared bottleneck feature to be discriminative to phonemes and independent of speakers. However, owing to the absence of phoneme labels, zero-resource languages cannot employ adversarial multi-task (AMT) learning for speaker normalization. In this work, we obtain a posteriorgram from a Dirichlet process Gaussian mixture model (DPGMM) and utilize the posterior vector for supervision of the phoneme estimation in the AMT training. The AMT network is designed so that the DPGMM posteriorgram itself is embedded in a speaker-invariant feature space. The proposed network is expected to resolve the potential problem that the posteriorgram may lack reliability as a phoneme representation if the DPGMM components are intermingled with phoneme and speaker information. Based on the Zero Resource Speech Challenges, we conduct phoneme discriminant experiments on the extracted features. The results of the experiments show that the proposed framework extracts discriminative features, suppressing the variety in speakers.
KW - Adversarial multi-task learning
KW - Dirichlet process Gaussian mixture model
KW - Embeddings
KW - Speech recognition
KW - Zero-resource language
UR - http://www.scopus.com/inward/record.url?scp=85074692185&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85074692185&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2019-2052
DO - 10.21437/Interspeech.2019-2052
M3 - Conference article
AN - SCOPUS:85074692185
SN - 2308-457X
VL - 2019-September
SP - 266
EP - 270
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019
Y2 - 15 September 2019 through 19 September 2019
ER -