Wav2pos: Exploring syntactic analysis from audio for Highland Puebla Nahuatl

Robert Pugh, Varun Sreedhar, Francis Tyers


Abstract
We describe an approach to part-of-speech tagging from audio with very little human-annotated data, for Highland Puebla Nahuatl, a low-resource language of Mexico. While automatic morphosyntactic analysis is typically trained on annotated textual data, large amounts of text is rarely available for low-resource, marginalized, and/or minority languages, and morphosyntactically-annotated data is even harder to come by. Much of the data from these languages may exist in the form of recordings, often only partially-transcribed or analyzed by field linguists working on language documentation projects. Given this relatively low-availability of text in the low-resource language scenario, we explore end-to-end automated morphosyntactic analysis directly from audio. The experiments described in this paper focus on one piece of morphosyntax, part-of-speech tagging, and builds on existing work in a high-resource setting. We use weak supervision to increase training volume, and explore a few techniques for generating word-level predictions from the acoustic features. Our experiments show promising results, despite less than 400 sentences of audio-aligned, manually-labeled text.
Anthology ID:
2024.americasnlp-1.13
Volume:
Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Manuel Mager, Abteen Ebrahimi, Shruti Rijhwani, Arturo Oncevay, Luis Chiruzzo, Robert Pugh, Katharina von der Wense
Venues:
AmericasNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
121–126
Language:
URL:
https://aclanthology.org/2024.americasnlp-1.13
DOI:
Bibkey:
Cite (ACL):
Robert Pugh, Varun Sreedhar, and Francis Tyers. 2024. Wav2pos: Exploring syntactic analysis from audio for Highland Puebla Nahuatl. In Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024), pages 121–126, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Wav2pos: Exploring syntactic analysis from audio for Highland Puebla Nahuatl (Pugh et al., AmericasNLP-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/jeptaln-2024-ingestion/2024.americasnlp-1.13.pdf