Towards the Creation of a Collao Quechua–Spanish Parallel Corpus Using Optical Character Recognition

Gian Carlo Orcotoma Mormontoy, Lida Leon Nuñez, Hugo Espetia Huamanga


Abstract
The Quechua language stands as a fundamental element of Peru’s social and cultural identity, carries linguistic and cultural significance. However, it faces substantial challenges in terms of digital representation. One major limitation is the scarcity of resources such as a parallel corpus, which limits the development of technological resources for its analysis and practical application. This study addresses this gap through a methodology for building a parallel corpus using Optical Character Recognition (OCR). We digitized a collection of texts from a common origin to create a corpus that enables reliable access. The resulting corpus serves as a valuable asset for linguistic and Natural Language Processing (NLP) research, as well as for Quechua speakers. The source material derives from works produced by graduate students from the Academia Mayor de la Lengua Quechua, validated by academic staff, ensuring grammatical, syntactic and semantic integrity.
Anthology ID:
2025.globalnlp-1.1
Volume:
Proceedings of the Workshop on Beyond English: Natural Language Processing for all Languages in an Era of Large Language Models
Month:
September
Year:
2025
Address:
Varna, Bulgaria
Editors:
Sudhansu Bala Das, Pruthwik Mishra, Alok Singh, Shamsuddeen Hassan Muhammad, Asif Ekbal, Uday Kumar Das
Venues:
GlobalNLP | WS
SIG:
Publisher:
INCOMA Ltd., Shoumen, BULGARIA
Note:
Pages:
1–6
Language:
URL:
https://preview.aclanthology.org/corrections-2026-01/2025.globalnlp-1.1/
DOI:
Bibkey:
Cite (ACL):
Gian Carlo Orcotoma Mormontoy, Lida Leon Nuñez, and Hugo Espetia Huamanga. 2025. Towards the Creation of a Collao Quechua–Spanish Parallel Corpus Using Optical Character Recognition. In Proceedings of the Workshop on Beyond English: Natural Language Processing for all Languages in an Era of Large Language Models, pages 1–6, Varna, Bulgaria. INCOMA Ltd., Shoumen, BULGARIA.
Cite (Informal):
Towards the Creation of a Collao Quechua–Spanish Parallel Corpus Using Optical Character Recognition (Orcotoma Mormontoy et al., GlobalNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/corrections-2026-01/2025.globalnlp-1.1.pdf