Gian Carlo Orcotoma Mormontoy
2025
Towards the Creation of a Collao Quechua–Spanish Parallel Corpus Using Optical Character Recognition
Gian Carlo Orcotoma Mormontoy
|
Lida Leon Nuñez
|
Hugo Espetia Huamanga
Proceedings of the Workshop on Beyond English: Natural Language Processing for all Languages in an Era of Large Language Models
The Quechua language stands as a fundamental element of Peru’s social and cultural identity, carries linguistic and cultural significance. However, it faces substantial challenges in terms of digital representation. One major limitation is the scarcity of resources such as a parallel corpus, which limits the development of technological resources for its analysis and practical application. This study addresses this gap through a methodology for building a parallel corpus using Optical Character Recognition (OCR). We digitized a collection of texts from a common origin to create a corpus that enables reliable access. The resulting corpus serves as a valuable asset for linguistic and Natural Language Processing (NLP) research, as well as for Quechua speakers. The source material derives from works produced by graduate students from the Academia Mayor de la Lengua Quechua, validated by academic staff, ensuring grammatical, syntactic and semantic integrity.