Toward More Meaningful Resources for Lower-resourced Languages

Constantine Lignos, Nolan Holley, Chester Palen-Michel, Jonne Sälevä


Abstract
In this position paper, we describe our perspective on how meaningful resources for lower-resourced languages should be developed in connection with the speakers of those languages. Before advancing that position, we first examine two massively multilingual resources used in language technology development, identifying shortcomings that limit their usefulness. We explore the contents of the names stored in Wikidata for a few lower-resourced languages and find that many of them are not in fact in the languages they claim to be, requiring non-trivial effort to correct. We discuss quality issues present in WikiAnn and evaluate whether it is a useful supplement to hand-annotated data. We then discuss the importance of creating annotations for lower-resourced languages in a thoughtful and ethical way that includes the language speakers as part of the development process. We conclude with recommended guidelines for resource development.
Anthology ID:
2022.findings-acl.44
Volume:
Findings of the Association for Computational Linguistics: ACL 2022
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
523–532
Language:
URL:
https://aclanthology.org/2022.findings-acl.44
DOI:
10.18653/v1/2022.findings-acl.44
Bibkey:
Cite (ACL):
Constantine Lignos, Nolan Holley, Chester Palen-Michel, and Jonne Sälevä. 2022. Toward More Meaningful Resources for Lower-resourced Languages. In Findings of the Association for Computational Linguistics: ACL 2022, pages 523–532, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Toward More Meaningful Resources for Lower-resourced Languages (Lignos et al., Findings 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl24-info/2022.findings-acl.44.pdf
Video:
 https://preview.aclanthology.org/naacl24-info/2022.findings-acl.44.mp4
Data
MasakhaNERWikiANN