Unlocking Knowledge with OCR-Driven Document Digitization for Peruvian Indigenous Languages

Shadya Sanchez Carrera, Roberto Zariquiey, Arturo Oncevay


Abstract
The current focus on resource-rich languages poses a challenge to linguistic diversity, affecting minority languages with limited digital presence and relatively old published and unpublished resources. In addressing this issue, this study targets the digitalization of old scanned textbooks written in four Peruvian indigenous languages (Asháninka, Shipibo-Konibo, Yanesha, and Yine) using Optical Character Recognition (OCR) technology. This is complemented with text correction methods to minimize extraction errors. Contributions include the creation of an annotated dataset with 454 scanned page images, for a rigorous evaluation, and the development of a module to correct OCR-generated transcription alignments.
Anthology ID:
2024.americasnlp-1.11
Volume:
Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Manuel Mager, Abteen Ebrahimi, Shruti Rijhwani, Arturo Oncevay, Luis Chiruzzo, Robert Pugh, Katharina von der Wense
Venues:
AmericasNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
103–111
Language:
URL:
https://aclanthology.org/2024.americasnlp-1.11
DOI:
Bibkey:
Cite (ACL):
Shadya Sanchez Carrera, Roberto Zariquiey, and Arturo Oncevay. 2024. Unlocking Knowledge with OCR-Driven Document Digitization for Peruvian Indigenous Languages. In Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024), pages 103–111, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Unlocking Knowledge with OCR-Driven Document Digitization for Peruvian Indigenous Languages (Sanchez Carrera et al., AmericasNLP-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/jeptaln-2024-ingestion/2024.americasnlp-1.11.pdf
Supplementary material:
 2024.americasnlp-1.11.SupplementaryMaterial.zip