Developing Infrastructure for Low-Resource Language Corpus Building
Hedwig G. Sekeres, Wilbert Heeringa, Wietse de Vries, Oscar Yde Zwagers, Martijn Wieling, Goffe Th. Jensma
Abstract
For many of the world’s small languages, few resources are available. In this project, a written online accessible corpus was created for the minority language variant Gronings, which serves both researchers interested in language change and variation and a general audience of (new) speakers interested in finding real-life examples of language use. The corpus was created using a combination of volunteer work and automation, which together formed an efficient pipeline for converting printed text to Key Words in Context (KWICs), annotated with lemmas and part-of-speech tags. In the creation of the corpus, we have taken into account several of the challenges that can occur when creating resources for minority languages, such as a lack of standardisation and limited (financial) resources. As the solutions we offer are applicable to other small languages as well, each step of the corpus creation process is discussed and resources will be made available benefiting future projects on other low-resource languages.- Anthology ID:
- 2024.sigul-1.10
- Volume:
- Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024
- Month:
- May
- Year:
- 2024
- Address:
- Torino, Italia
- Editors:
- Maite Melero, Sakriani Sakti, Claudia Soria
- Venues:
- SIGUL | WS
- SIG:
- Publisher:
- ELRA and ICCL
- Note:
- Pages:
- 72–78
- Language:
- URL:
- https://aclanthology.org/2024.sigul-1.10
- DOI:
- Cite (ACL):
- Hedwig G. Sekeres, Wilbert Heeringa, Wietse de Vries, Oscar Yde Zwagers, Martijn Wieling, and Goffe Th. Jensma. 2024. Developing Infrastructure for Low-Resource Language Corpus Building. In Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024, pages 72–78, Torino, Italia. ELRA and ICCL.
- Cite (Informal):
- Developing Infrastructure for Low-Resource Language Corpus Building (Sekeres et al., SIGUL-WS 2024)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-4/2024.sigul-1.10.pdf