Zhengyang Lu


2026

Chinese historical documents encode millennia of cultural heritage, yet remain largely inaccessible to computational analysis. While multimodal large language models (MLLMs) have achieved strong performance on modern document OCR, their application to historical Chinese texts suffers from severe hallucinations, character fabrication, uncontrolled repetition, and semantic drift. We identify the root cause as visual-textual misalignment: models prioritize linguistic priors over visual evidence, particularly problematic when archaic orthography and degraded image quality destabilize cross-modal correspondences. To address this, we propose HisDoc-OCR, which restores visual grounding through three synergistic strategies: (1) Layout Injection, which encodes two-dimensional layout structures into textual outputs using layout-aware delimiters; (2) First-Occurrence Boost, which emphasizes vision-dependent characters during training by reweighting first-occurrence characters; (3) Self-Distilled Attention Focusing, which guides the model’s attention by distilling patterns from the most focused layer to the remaining layers. Extensive experiments demonstrate that HisDoc-OCR consistently outperforms general-purpose and OCR-specific MLLMs. The code will be publicly available.