Creating Corpus of Low Resource Indian Languages for Natural Language Processing: Challenges and Opportunities

Pratibha Dongare


Abstract
Addressing tasks in Natural Language Processing requires access to sufficient and high-quality data. However, working with languages that have limited resources poses a significant challenge due to the absence of established methodologies, frameworks, and collaborative efforts. This paper intends to briefly outline the challenges associated with standardization in data creation, focusing on Indian languages, which are often categorized as low resource languages. Additionally, potential solutions and the importance of standardized procedures for low-resource language data are proposed. Furthermore, the critical role of standardized protocols in corpus creation and their impact on research is highlighted. Lastly, this paper concludes by defining what constitutes a corpus.
Anthology ID:
2024.wildre-1.8
Volume:
Proceedings of the 7th Workshop on Indian Language Data: Resources and Evaluation
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Girish Nath Jha, Sobha L., Kalika Bali, Atul Kr. Ojha
Venues:
WILDRE | WS
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
54–58
Language:
URL:
https://aclanthology.org/2024.wildre-1.8
DOI:
Bibkey:
Cite (ACL):
Pratibha Dongare. 2024. Creating Corpus of Low Resource Indian Languages for Natural Language Processing: Challenges and Opportunities. In Proceedings of the 7th Workshop on Indian Language Data: Resources and Evaluation, pages 54–58, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Creating Corpus of Low Resource Indian Languages for Natural Language Processing: Challenges and Opportunities (Dongare, WILDRE-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-5/2024.wildre-1.8.pdf