TY - GEN
T1 - Speaker Invariant Feature Extraction for Zero-Resource Languages with Adversarial Learning
AU - Tsuchiya, Taira
AU - Tawara, Naohiro
AU - Ogawa, Testuji
AU - Kobayashi, Tetsunori
N1 - Funding Information:
This work was supported by JSPS KAKENHI Grant Number JP16K12465.
Publisher Copyright:
© 2018 IEEE.
PY - 2018/9/10
Y1 - 2018/9/10
N2 - We introduce a novel type of representation learning to obtain a speaker invariant feature for zero-resource languages. Speaker adaptation is an important technique to build a robust acoustic model. For a zero-resource language, however, conventional model-dependent speaker adaptation methods such as constrained maximum likelihood linear regression are insufficient because the acoustic model of the target language is not accessible. Therefore, we introduce a model-independent feature extraction based on a neural network. Specifically, we introduce a multi-task learning to a bottleneck feature-based approach to make bottleneck feature invariant to a change of speakers. The proposed network simultaneously tackles two tasks: phoneme and speaker classifications. This network trains a feature extractor in an adversarial manner to allow it to map input data into a discriminative representation to predict phonemes, whereas it is difficult to predict speakers. We conduct phone discriminant experiments in Zero Resource Speech Challenge 2017. Experimental results showed that our multi-task network yielded more discriminative features eliminating the variety in speakers.
AB - We introduce a novel type of representation learning to obtain a speaker invariant feature for zero-resource languages. Speaker adaptation is an important technique to build a robust acoustic model. For a zero-resource language, however, conventional model-dependent speaker adaptation methods such as constrained maximum likelihood linear regression are insufficient because the acoustic model of the target language is not accessible. Therefore, we introduce a model-independent feature extraction based on a neural network. Specifically, we introduce a multi-task learning to a bottleneck feature-based approach to make bottleneck feature invariant to a change of speakers. The proposed network simultaneously tackles two tasks: phoneme and speaker classifications. This network trains a feature extractor in an adversarial manner to allow it to map input data into a discriminative representation to predict phonemes, whereas it is difficult to predict speakers. We conduct phone discriminant experiments in Zero Resource Speech Challenge 2017. Experimental results showed that our multi-task network yielded more discriminative features eliminating the variety in speakers.
KW - Adversarial multi-task learning
KW - FMLLR
KW - Representation learning
KW - Speaker invariant feature
KW - Zero resource speech challenge
UR - http://www.scopus.com/inward/record.url?scp=85054272708&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85054272708&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2018.8461648
DO - 10.1109/ICASSP.2018.8461648
M3 - Conference contribution
AN - SCOPUS:85054272708
SN - 9781538646588
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 2381
EP - 2385
BT - 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018
Y2 - 15 April 2018 through 20 April 2018
ER -