TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction

Chengye Wang, Lin Fu, Zexi Kuang, Yilun Zhao


Abstract
Existing document OCR largely targets plain text or Markdown, discarding the structural and executable properties that make LaTeX essential for scientific publishing. We study page-level reconstruction of scientific PDFs into compilable LaTeX and introduce TexOCR-Bench, a benchmark, and TexOCR-Train, a large-scale training corpus, for this task. TexOCR-Bench features a multi-dimensional evaluation suite that jointly assesses transcription fidelity, structural faithfulness, and end-to-end compilability. Leveraging TexOCR-Train, we train a 2B-parameter model, TexOCR, using supervised fine-tuning (SFT) and reinforcement learning (RL) with verifiable rewards derived from LaTeX unit tests that directly enforce compilability and referential integrity. Experiments across 21 frontier models on TexOCR-Bench show that existing systems frequently violate key document invariants, including consistent section structure, correct float placement, and valid label–reference links, which undermines compilation reliability and downstream usability. Our analysis further reveals that RL with verifiable rewards yields consistent improvements over SFT alone, particularly on structural and compilation metrics.
Anthology ID:
2026.acl-long.1658
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
35821–35845
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1658/
DOI:
Bibkey:
Cite (ACL):
Chengye Wang, Lin Fu, Zexi Kuang, and Yilun Zhao. 2026. TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 35821–35845, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction (Wang et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1658.pdf
Checklist:
 2026.acl-long.1658.checklist.pdf