TY - GEN
T1 - Differentiable allophone graphs for language-universal speech recognition
AU - Yan, Brian
AU - Dalmia, Siddharth
AU - Mortensen, David R.
AU - Metze, Florian
AU - Watanabe, Shinji
N1 - Funding Information:
We thank Xinjian Li and Awni Hannun for helpful discussions. This work was supported in part by grants from National Science Foundation for Bridges PSC (ACI-1548562, ACI-1445606) and DARPA KAIROS program from the Air Force Research Laboratory (FA8750-19-2-0200). The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon.
Publisher Copyright:
Copyright © 2021 ISCA.
PY - 2021
Y1 - 2021
N2 - Building language-universal speech recognition systems entails producing phonological units of spoken sound that can be shared across languages. While speech annotations at the languagespecific phoneme or surface levels are readily available, annotations at a universal phone level are relatively rare and difficult to produce. In this work, we present a general framework to derive phone-level supervision from only phonemic transcriptions and phone-to-phoneme mappings with learnable weights represented using weighted finite-state transducers, which we call differentiable allophone graphs. By training multilingually, we build a universal phone-based speech recognition model with interpretable probabilistic phone-to-phoneme mappings for each language. These phone-based systems with learned allophone graphs can be used by linguists to document new languages, build phone-based lexicons that capture rich pronunciation variations, and re-evaluate the allophone mappings of seen language. We demonstrate the aforementioned benefits of our proposed framework with a system trained on 7 diverse languages.
AB - Building language-universal speech recognition systems entails producing phonological units of spoken sound that can be shared across languages. While speech annotations at the languagespecific phoneme or surface levels are readily available, annotations at a universal phone level are relatively rare and difficult to produce. In this work, we present a general framework to derive phone-level supervision from only phonemic transcriptions and phone-to-phoneme mappings with learnable weights represented using weighted finite-state transducers, which we call differentiable allophone graphs. By training multilingually, we build a universal phone-based speech recognition model with interpretable probabilistic phone-to-phoneme mappings for each language. These phone-based systems with learned allophone graphs can be used by linguists to document new languages, build phone-based lexicons that capture rich pronunciation variations, and re-evaluate the allophone mappings of seen language. We demonstrate the aforementioned benefits of our proposed framework with a system trained on 7 diverse languages.
KW - Allophones
KW - Differentiable wfst
KW - Multilingual asr
KW - Phonetic pronunciation
KW - Universal phone recognition
UR - http://www.scopus.com/inward/record.url?scp=85116705031&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85116705031&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2021-1944
DO - 10.21437/Interspeech.2021-1944
M3 - Conference contribution
AN - SCOPUS:85116705031
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 356
EP - 360
BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
PB - International Speech Communication Association
T2 - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
Y2 - 30 August 2021 through 3 September 2021
ER -