Toward More Meaningful Resources for Lower-resourced Languages
Constantine Lignos, Nolan Holley, Chester Palen-Michel, Jonne Sälevä
Abstract
In this position paper, we describe our perspective on how meaningful resources for lower-resourced languages should be developed in connection with the speakers of those languages. Before advancing that position, we first examine two massively multilingual resources used in language technology development, identifying shortcomings that limit their usefulness. We explore the contents of the names stored in Wikidata for a few lower-resourced languages and find that many of them are not in fact in the languages they claim to be, requiring non-trivial effort to correct. We discuss quality issues present in WikiAnn and evaluate whether it is a useful supplement to hand-annotated data. We then discuss the importance of creating annotations for lower-resourced languages in a thoughtful and ethical way that includes the language speakers as part of the development process. We conclude with recommended guidelines for resource development.- Anthology ID:
- 2022.findings-acl.44
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2022
- Month:
- May
- Year:
- 2022
- Address:
- Dublin, Ireland
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 523–532
- Language:
- URL:
- https://aclanthology.org/2022.findings-acl.44
- DOI:
- 10.18653/v1/2022.findings-acl.44
- Cite (ACL):
- Constantine Lignos, Nolan Holley, Chester Palen-Michel, and Jonne Sälevä. 2022. Toward More Meaningful Resources for Lower-resourced Languages. In Findings of the Association for Computational Linguistics: ACL 2022, pages 523–532, Dublin, Ireland. Association for Computational Linguistics.
- Cite (Informal):
- Toward More Meaningful Resources for Lower-resourced Languages (Lignos et al., Findings 2022)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2022.findings-acl.44.pdf
- Data
- MasakhaNER, WikiAnn