Xuankai Chang


2021

pdf bib
Highland Puebla Nahuatl Speech Translation Corpus for Endangered Language Documentation
Jiatong Shi | Jonathan D. Amith | Xuankai Chang | Siddharth Dalmia | Brian Yan | Shinji Watanabe
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

Documentation of endangered languages (ELs) has become increasingly urgent as thousands of languages are on the verge of disappearing by the end of the 21st century. One challenging aspect of documentation is to develop machine learning tools to automate the processing of EL audio via automatic speech recognition (ASR), machine translation (MT), or speech translation (ST). This paper presents an open-access speech translation corpus of Highland Puebla Nahuatl (glottocode high1278), an EL spoken in central Mexico. It then addresses machine learning contributions to endangered language documentation and argues for the importance of speech translation as a key element in the documentation process. In our experiments, we observed that state-of-the-art end-to-end ST models could outperform a cascaded ST (ASR > MT) pipeline when translating endangered language documentation materials.