HisDoc-OCR: Restoring Visual Grounding in MLLMs for Chinese Historical Document OCR

Jiahuan Cao, Yongxin Shi, Zeyu Shan, Zhengyang Lu, Lianwen Jin


Abstract
Chinese historical documents encode millennia of cultural heritage, yet remain largely inaccessible to computational analysis. While multimodal large language models (MLLMs) have achieved strong performance on modern document OCR, their application to historical Chinese texts suffers from severe hallucinations, character fabrication, uncontrolled repetition, and semantic drift. We identify the root cause as visual-textual misalignment: models prioritize linguistic priors over visual evidence, particularly problematic when archaic orthography and degraded image quality destabilize cross-modal correspondences. To address this, we propose HisDoc-OCR, which restores visual grounding through three synergistic strategies: (1) Layout Injection, which encodes two-dimensional layout structures into textual outputs using layout-aware delimiters; (2) First-Occurrence Boost, which emphasizes vision-dependent characters during training by reweighting first-occurrence characters; (3) Self-Distilled Attention Focusing, which guides the model’s attention by distilling patterns from the most focused layer to the remaining layers. Extensive experiments demonstrate that HisDoc-OCR consistently outperforms general-purpose and OCR-specific MLLMs. The code will be publicly available.
Anthology ID:
2026.findings-acl.301
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6053–6066
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.301/
DOI:
Bibkey:
Cite (ACL):
Jiahuan Cao, Yongxin Shi, Zeyu Shan, Zhengyang Lu, and Lianwen Jin. 2026. HisDoc-OCR: Restoring Visual Grounding in MLLMs for Chinese Historical Document OCR. In Findings of the Association for Computational Linguistics: ACL 2026, pages 6053–6066, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
HisDoc-OCR: Restoring Visual Grounding in MLLMs for Chinese Historical Document OCR (Cao et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.301.pdf
Checklist:
 2026.findings-acl.301.checklist.pdf