Zhengyang Lu

2026

HisDoc-OCR: Restoring Visual Grounding in MLLMs for Chinese Historical Document OCR
Jiahuan Cao | Yongxin Shi | Zeyu Shan | Zhengyang Lu | Lianwen Jin
Findings of the Association for Computational Linguistics: ACL 2026

Chinese historical documents encode millennia of cultural heritage, yet remain largely inaccessible to computational analysis. While multimodal large language models (MLLMs) have achieved strong performance on modern document OCR, their application to historical Chinese texts suffers from severe hallucinations, character fabrication, uncontrolled repetition, and semantic drift. We identify the root cause as visual-textual misalignment: models prioritize linguistic priors over visual evidence, particularly problematic when archaic orthography and degraded image quality destabilize cross-modal correspondences. To address this, we propose HisDoc-OCR, which restores visual grounding through three synergistic strategies: (1) Layout Injection, which encodes two-dimensional layout structures into textual outputs using layout-aware delimiters; (2) First-Occurrence Boost, which emphasizes vision-dependent characters during training by reweighting first-occurrence characters; (3) Self-Distilled Attention Focusing, which guides the model’s attention by distilling patterns from the most focused layer to the remaining layers. Extensive experiments demonstrate that HisDoc-OCR consistently outperforms general-purpose and OCR-specific MLLMs. The code will be publicly available.

Co-authors

Venues

Findings1

Fix author