CoRSAL-OCR: Evaluating Zero-Shot OCR for Language Archive Materials

Luke Gessler; Andrew Haynes

CoRSAL-OCR: Evaluating Zero-Shot OCR for Language Archive Materials

Abstract

Language archives contain valuable linguistic materials that are undigitized and therefore difficult to access. Modern optical character recognition (OCR) systems have great potential to make these collections more accessible, but there are few system evaluations which can assess the quality of an OCR system specifically for language archive materials. We present CoRSAL-OCR, an OCR evaluation dataset of over 200 document pages with gold-standard transcriptions from two South Asian languages: Bodo (written in Devanagari) and Garo (written in Latin script). Using this dataset together with the 8-language AILLA-OCR benchmark, we evaluate four OCR systems: Tesseract, Google Cloud Vision, Gemini 3 Flash, and Qwen3.5-27B (an open-weight model). We find that vision language models (VLMs), when given appropriate prompts, achieve the lowest error rates on these datasets. However, prompt design has a large effect on VLM performance, with a detailed generic prompt reducing CER by up to six-fold compared to a minimal prompt. We release our dataset at https://github.com/larc-iu/corsal-ocr to support further research on OCR for language archives.

Anthology ID:: 2026.computel-1.14
Volume:: Proceedings of the Ninth Workshop on the Use of Computational Methods in the Study of Endangered Languages (ComputEL-9)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Godfred Agyapong, Sarah Moeller, Antti Arppe, Ali Marashian, Daisy Rosenblum
Venues:: ComputEL | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 125–135
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.computel-1.14/
DOI:
Bibkey:
Cite (ACL):: Luke Gessler and Andrew Haynes. 2026. CoRSAL-OCR: Evaluating Zero-Shot OCR for Language Archive Materials. In Proceedings of the Ninth Workshop on the Use of Computational Methods in the Study of Endangered Languages (ComputEL-9), pages 125–135, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: CoRSAL-OCR: Evaluating Zero-Shot OCR for Language Archive Materials (Gessler & Haynes, ComputEL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.computel-1.14.pdf
Supplementarymaterial:: 2026.computel-1.14.SupplementaryMaterial.txt

PDF Cite Search Supplementarymaterial Fix data