Abstract
Addressing tasks in Natural Language Processing requires access to sufficient and high-quality data. However, working with languages that have limited resources poses a significant challenge due to the absence of established methodologies, frameworks, and collaborative efforts. This paper intends to briefly outline the challenges associated with standardization in data creation, focusing on Indian languages, which are often categorized as low resource languages. Additionally, potential solutions and the importance of standardized procedures for low-resource language data are proposed. Furthermore, the critical role of standardized protocols in corpus creation and their impact on research is highlighted. Lastly, this paper concludes by defining what constitutes a corpus.- Anthology ID:
- 2024.wildre-1.8
- Volume:
- Proceedings of the 7th Workshop on Indian Language Data: Resources and Evaluation
- Month:
- May
- Year:
- 2024
- Address:
- Torino, Italia
- Editors:
- Girish Nath Jha, Sobha L., Kalika Bali, Atul Kr. Ojha
- Venues:
- WILDRE | WS
- SIG:
- Publisher:
- ELRA and ICCL
- Note:
- Pages:
- 54–58
- Language:
- URL:
- https://aclanthology.org/2024.wildre-1.8
- DOI:
- Cite (ACL):
- Pratibha Dongare. 2024. Creating Corpus of Low Resource Indian Languages for Natural Language Processing: Challenges and Opportunities. In Proceedings of the 7th Workshop on Indian Language Data: Resources and Evaluation, pages 54–58, Torino, Italia. ELRA and ICCL.
- Cite (Informal):
- Creating Corpus of Low Resource Indian Languages for Natural Language Processing: Challenges and Opportunities (Dongare, WILDRE-WS 2024)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-5/2024.wildre-1.8.pdf