SubmissionNumber#=%=#23
FinalPaperTitle#=%=#CoRSAL-OCR: Evaluating Zero-Shot OCR for Language Archive Materials
ShortPaperTitle#=%=#
NumberOfPages#=%=#11
CopyrightSigned#=%=#Luke Gessler
JobTitle#==#
Organization#==#
Abstract#==#Language archives contain valuable linguistic materials that are undigitized and therefore difficult to access. Modern optical character recognition (OCR) systems have great potential to make these collections more accessible, but there are few system evaluations which can assess the quality of an OCR system specifically for language archive materials. We present CoRSAL-OCR, an OCR evaluation dataset of over 200 document pages with gold-standard transcriptions from two South Asian languages: Bodo (written in Devanagari) and Garo (written in Latin script). Using this dataset together with the 8-language AILLA-OCR benchmark, we evaluate four OCR systems: Tesseract, Google Cloud Vision, Gemini 3 Flash, and Qwen3.5-27B (an open-weight model). We find that vision language models (VLMs), when given appropriate prompts, achieve the lowest error rates on these datasets. However, prompt design has a large effect on VLM performance, with a detailed generic prompt reducing CER by up to six-fold compared to a minimal prompt. We release our dataset at https://github.com/larc-iu/corsal-ocr to support further research on OCR for language archives.
Author{1}{Firstname}#=%=#Luke
Author{1}{Lastname}#=%=#Gessler
Author{1}{Username}#=%=#lgessler
Author{1}{Orcid}#=%=#https://orcid.org/0000-0002-4996-9045
Author{1}{Email}#=%=#lgessler@iu.edu
Author{1}{Affiliation}#=%=#Indiana University Bloomington
Author{2}{Firstname}#=%=#Andrew
Author{2}{Lastname}#=%=#Haynes
Author{2}{Orcid}#=%=#
Author{2}{Email}#=%=#drew.naoki@gmail.com
Author{2}{Affiliation}#=%=#The Woodlands College Park High School

==========
èéáğö