SubmissionNumber#=%=#23 FinalPaperTitle#=%=#CoRSAL-OCR: Evaluating Zero-Shot OCR for Language Archive Materials ShortPaperTitle#=%=# NumberOfPages#=%=#11 CopyrightSigned#=%=#Luke Gessler JobTitle#==# Organization#==# Abstract#==#Language archives contain valuable linguistic materials that are undigitized and therefore difficult to access. Modern optical character recognition (OCR) systems have great potential to make these collections more accessible, but there are few system evaluations which can assess the quality of an OCR system specifically for language archive materials. We present CoRSAL-OCR, an OCR evaluation dataset of over 200 document pages with gold-standard transcriptions from two South Asian languages: Bodo (written in Devanagari) and Garo (written in Latin script). Using this dataset together with the 8-language AILLA-OCR benchmark, we evaluate four OCR systems: Tesseract, Google Cloud Vision, Gemini 3 Flash, and Qwen3.5-27B (an open-weight model). We find that vision language models (VLMs), when given appropriate prompts, achieve the lowest error rates on these datasets. However, prompt design has a large effect on VLM performance, with a detailed generic prompt reducing CER by up to six-fold compared to a minimal prompt. We release our dataset at https://github.com/larc-iu/corsal-ocr to support further research on OCR for language archives. Author{1}{Firstname}#=%=#Luke Author{1}{Lastname}#=%=#Gessler Author{1}{Username}#=%=#lgessler Author{1}{Orcid}#=%=#https://orcid.org/0000-0002-4996-9045 Author{1}{Email}#=%=#lgessler@iu.edu Author{1}{Affiliation}#=%=#Indiana University Bloomington Author{2}{Firstname}#=%=#Andrew Author{2}{Lastname}#=%=#Haynes Author{2}{Orcid}#=%=# Author{2}{Email}#=%=#drew.naoki@gmail.com Author{2}{Affiliation}#=%=#The Woodlands College Park High School ========== èéáğö