Jake Poznanski


2026

We present olmOCR, a fully open OCR system developed through iterative public releases and community feedback. The system combines a 7B vision-language model trained in two stages: supervised finetuning on 260K diverse PDF pages, followed by reinforcement learning with visual unit tests over synthetic documents. Visual unit tests are binary checks of structural fidelity, including tables and equations, and serve both as an interpretable evaluation framework and as direct optimization targets. We also introduce olmOCR-Bench, a benchmark of 1.4K challenging PDFs evaluated via visual unit tests, on which olmOCR achieves state-of-the-art performance among open systems and proprietary APIs at a fraction of the cost. We have deployed olmOCR at scale to 100M+ PDFs to curate pretraining data for Olmo 3. We share lessons from our open development process and release all models, data, and code across two major releases.