Highland Puebla Nahuatl–Spanish Speech Translation Corpus for Endangered Language Documentation

Jiatong Shi, Jonathan D. Amith, Xuankai Chang, Siddharth Dalmia, Brian Yan, Shinji Watanabe

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Documentation of endangered languages (ELs) has become increasingly urgent as thousands of languages are on the verge of disappearing by the end of the 21st century. One challenging aspect of documentation is to develop machine learning tools to automate the processing of EL audio via automatic speech recognition (ASR), machine translation (MT), or speech translation (ST). This paper presents an open-access speech translation corpus of Highland Puebla Nahuatl (glottocode high1278), an EL spoken in central Mexico. It then addresses machine learning contributions to endangered language documentation and argues for the importance of speech translation as a key element in the documentation process. In our experiments, we observed that state-of-the-art end-to-end ST models could outperform a cascaded ST (ASR > MT) pipeline when translating endangered language documentation materials.

Original languageEnglish
Title of host publicationProceedings of the 1st Workshop on Natural Language Processing for Indigenous Languages of the Americas, AmericasNLP 2021
EditorsManuel Mager, Arturo Oncevay, Annette Rios, Ivan Vladimir Meza Ruiz, Alexis Palmer, Graham Neubig, Katharina Kann
PublisherAssociation for Computational Linguistics (ACL)
Pages53-63
Number of pages11
ISBN (Electronic)9781954085442
Publication statusPublished - 2021
Externally publishedYes
Event1st Workshop on Natural Language Processing for Indigenous Languages of the Americas, AmericasNLP 2021 - Virtual, Online
Duration: 2021 Jun 11 → …

Publication series

NameProceedings of the 1st Workshop on Natural Language Processing for Indigenous Languages of the Americas, AmericasNLP 2021

Conference

Conference1st Workshop on Natural Language Processing for Indigenous Languages of the Americas, AmericasNLP 2021
CityVirtual, Online
Period21/6/11 → …

ASJC Scopus subject areas

  • Computer Science Applications
  • Computational Theory and Mathematics
  • Information Systems

Fingerprint

Dive into the research topics of 'Highland Puebla Nahuatl–Spanish Speech Translation Corpus for Endangered Language Documentation'. Together they form a unique fingerprint.

Cite this