The olmOCR Project: Building Fully Open OCR using VLMs

Jake Poznanski, Kyle Lo, Luca Soldaini


Abstract
We present olmOCR, a fully open OCR system developed through iterative public releases and community feedback. The system combines a 7B vision-language model trained in two stages: supervised finetuning on 260K diverse PDF pages, followed by reinforcement learning with visual unit tests over synthetic documents. Visual unit tests are binary checks of structural fidelity, including tables and equations, and serve both as an interpretable evaluation framework and as direct optimization targets. We also introduce olmOCR-Bench, a benchmark of 1.4K challenging PDFs evaluated via visual unit tests, on which olmOCR achieves state-of-the-art performance among open systems and proprietary APIs at a fraction of the cost. We have deployed olmOCR at scale to 100M+ PDFs to curate pretraining data for Olmo 3. We share lessons from our open development process and release all models, data, and code across two major releases.
Anthology ID:
2026.acl-demo.62
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Greg Durrett, Ping Jian
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
626–635
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-demo.62/
DOI:
Bibkey:
Cite (ACL):
Jake Poznanski, Kyle Lo, and Luca Soldaini. 2026. The olmOCR Project: Building Fully Open OCR using VLMs. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 626–635, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
The olmOCR Project: Building Fully Open OCR using VLMs (Poznanski et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-demo.62.pdf