TY - GEN
T1 - Highland Puebla Nahuatl–Spanish Speech Translation Corpus for Endangered Language Documentation
AU - Shi, Jiatong
AU - Amith, Jonathan D.
AU - Chang, Xuankai
AU - Dalmia, Siddharth
AU - Yan, Brian
AU - Watanabe, Shinji
N1 - Funding Information:
The authors gratefully acknowledge the following support for documenting and studying Highland Puebla Nahuat: National Science Foundation, Documenting Endangered Languages (DEL): Awards 1401178, 0756536 (Amith, PI on both awards); National Endowment for the Humanities, Preservation and Access: PD-50031-14 (Amith, PI); Endangered Language Documentation Programme: Award MDP0272 (Amith, PI). The native speaker documentation team responsible for transcription and translation included Amelia Domínguez Alcántara. Ceferino Salgado Cas-tañeda, Hermelindo Salazar Osollo, and Eleu-terio Gorostiza Salazar. Yoloxóchitl Mixtec documentation was supported by the following grants: NSF-DEL: Awards 1761421, 1500595, 0966462 (Amith, PI on all three awards; the second was a collaborative project with SRI International, Award 1500738, Andreas Kathol, PI); Endangered Language Documentation Programme: Awards MDP0201, PPG0048 (Amith, PI on both awards). Rey Castillo García has been responsible for all transcriptions.
Funding Information:
The authors gratefully acknowledge the following support for documenting and studying Highland Puebla Nahuat: National Science Foundation, Documenting Endangered Languages (DEL): Awards 1401178, 0756536 (Amith, PI on both awards); National Endowment for the Humanities, Preservation and Access: PD-50031-14 (Amith, PI); Endangered Language Documentation Programme: Award MDP0272 (Amith, PI). The native speaker documentation team responsible for transcription and translation included Amelia Dom?nguez Alc?ntara. Ceferino Salgado Casta?eda, Hermelindo Salazar Osollo, and Eleuterio Gorostiza Salazar. Yolox?chitl Mixtec documentation was supported by the following grants: NSF-DEL: Awards 1761421, 1500595, 0966462 (Amith, PI on all three awards; the second was a collaborative project with SRI International, Award 1500738, Andreas Kathol, PI); Endangered Language Documentation Programme: Awards MDP0201, PPG0048 (Amith, PI on both awards). Rey Castillo Garc?a has been responsible for all transcriptions.
Publisher Copyright:
© 2021 Association for Computational Linguistics
PY - 2021
Y1 - 2021
N2 - Documentation of endangered languages (ELs) has become increasingly urgent as thousands of languages are on the verge of disappearing by the end of the 21st century. One challenging aspect of documentation is to develop machine learning tools to automate the processing of EL audio via automatic speech recognition (ASR), machine translation (MT), or speech translation (ST). This paper presents an open-access speech translation corpus of Highland Puebla Nahuatl (glottocode high1278), an EL spoken in central Mexico. It then addresses machine learning contributions to endangered language documentation and argues for the importance of speech translation as a key element in the documentation process. In our experiments, we observed that state-of-the-art end-to-end ST models could outperform a cascaded ST (ASR > MT) pipeline when translating endangered language documentation materials.
AB - Documentation of endangered languages (ELs) has become increasingly urgent as thousands of languages are on the verge of disappearing by the end of the 21st century. One challenging aspect of documentation is to develop machine learning tools to automate the processing of EL audio via automatic speech recognition (ASR), machine translation (MT), or speech translation (ST). This paper presents an open-access speech translation corpus of Highland Puebla Nahuatl (glottocode high1278), an EL spoken in central Mexico. It then addresses machine learning contributions to endangered language documentation and argues for the importance of speech translation as a key element in the documentation process. In our experiments, we observed that state-of-the-art end-to-end ST models could outperform a cascaded ST (ASR > MT) pipeline when translating endangered language documentation materials.
UR - http://www.scopus.com/inward/record.url?scp=85116773815&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85116773815&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85116773815
T3 - Proceedings of the 1st Workshop on Natural Language Processing for Indigenous Languages of the Americas, AmericasNLP 2021
SP - 53
EP - 63
BT - Proceedings of the 1st Workshop on Natural Language Processing for Indigenous Languages of the Americas, AmericasNLP 2021
A2 - Mager, Manuel
A2 - Oncevay, Arturo
A2 - Rios, Annette
A2 - Ruiz, Ivan Vladimir Meza
A2 - Palmer, Alexis
A2 - Neubig, Graham
A2 - Kann, Katharina
PB - Association for Computational Linguistics (ACL)
T2 - 1st Workshop on Natural Language Processing for Indigenous Languages of the Americas, AmericasNLP 2021
Y2 - 11 June 2021
ER -