AILLA-OCR: A First Textual and Structural Post-OCR Dataset for 8 Indigenous Languages of Latin America

Milind Agarwal; Antonios Anastasopoulos

AILLA-OCR: A First Textual and Structural Post-OCR Dataset for 8 Indigenous Languages of Latin America

Abstract

It is by now common knowledge in the NLP community that low-resource languages need large-scale data creation efforts and novel con- tributions in the form of robust algorithms that work in data-scarce settings. Amongst these languages, however, many have a large amount of data, ripe for NLP applications, except that this data exists in image-based formats. This includes scanned copies of extremely valuable dictionaries, linguistic field notes, children’s stories, plays, and other textual material. To extract the text data from these non machine- readable images, Optical Character Recogni- tion (OCR) is the most popular technique, but it has proven to be challenging for low-resource languages because of their unique properties (uncommon diacritics, rare words etc.) and due to a general lack of preserved page-structure in the OCR output. So, to contribute to the reduc- tion of these two big bottlenecks (lack of text data and layout quality), we release the first textual and structural OCR dataset for 8 indige- nous languages of Latin America. We hope that our dataset will encourage researchers within the NLP and Computational Linguistics com- munities to work with these languages.

Anthology ID:: 2025.computel-main.13
Volume:: Proceedings of the Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages
Month:: March
Year:: 2025
Address:: Honolulu, Hawaii, USA
Editors:: Jordan Lachler, Godfred Agyapong, Antti Arppe, Sarah Moeller, Aditi Chaudhary, Shruti Rijhwani, Daisy Rosenblum
Venues:: ComputEL | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 120–127
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.computel-main.13/
DOI:
Bibkey:
Cite (ACL):: Milind Agarwal and Antonios Anastasopoulos. 2025. AILLA-OCR: A First Textual and Structural Post-OCR Dataset for 8 Indigenous Languages of Latin America. In Proceedings of the Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages, pages 120–127, Honolulu, Hawaii, USA. Association for Computational Linguistics.
Cite (Informal):: AILLA-OCR: A First Textual and Structural Post-OCR Dataset for 8 Indigenous Languages of Latin America (Agarwal & Anastasopoulos, ComputEL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.computel-main.13.pdf

PDF Cite Search Fix data