Abstract
This study takes up the task of low-resource morphological segmentation for Seneca, a critically endangered and morphologically complex Native American language primarily spoken in what is now New York State and Ontario. The labeled data in our experiments comes from two sources: one digitized from a publicly available grammar book and the other collected from informal sources. We treat these two sources as distinct domains and investigate different evaluation designs for model selection. The first design abides by standard practices and evaluate models with the in-domain development set, while the second one carries out evaluation using a development domain, or the out-of-domain development set. Across a series of monolingual and crosslinguistic training settings, our results demonstrate the utility of neural encoder-decoder architecture when coupled with multi-task learning.- Anthology ID:
- 2021.americasnlp-1.10
- Volume:
- Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas
- Month:
- June
- Year:
- 2021
- Address:
- Online
- Editors:
- Manuel Mager, Arturo Oncevay, Annette Rios, Ivan Vladimir Meza Ruiz, Alexis Palmer, Graham Neubig, Katharina Kann
- Venue:
- AmericasNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 90–101
- Language:
- URL:
- https://aclanthology.org/2021.americasnlp-1.10
- DOI:
- 10.18653/v1/2021.americasnlp-1.10
- Cite (ACL):
- Zoey Liu, Robert Jimerson, and Emily Prud’hommeaux. 2021. Morphological Segmentation for Seneca. In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, pages 90–101, Online. Association for Computational Linguistics.
- Cite (Informal):
- Morphological Segmentation for Seneca (Liu et al., AmericasNLP 2021)
- PDF:
- https://preview.aclanthology.org/ingest-acl-2023-videos/2021.americasnlp-1.10.pdf
- Code
- zoeyliu18/seneca