Beyond Atomic Characters: Glyph-Aware Sub-character Alignment for Low-Resource Multilingual OCR

Mengxiao Zhu, Haixu Chen, Jiu Sha, Jie Liu, Ge Shi


Abstract
Low-resource multilingual OCR faces a dual challenge: complex script structures and severe data scarcity. In such settings, existing OCR models often struggle, as coarse visual representations combined with weak linguistic priors lead to frequent errors among visually similar characters.To address this, we present BASA (Beyond Atomic Sub-character Alignment), a OCR framework built upon high-resolution visual and language backbones with a novel glyph-aware interface. The core technical contribution is the Glyph-Aware Fine-grained Adapter (GAFA). Unlike standard linear projectors, GAFA employs learnable glyph prototypes to actively align sub-character structural primitives (e.g., strokes and radicals) with visual features, explicitly resolving topological ambiguities during vision–language alignment. To complement this, we introduce a two-stage curriculum learning strategy supported by a Glyph-Aware Reverse Synthesis pipeline, which generates large-scale multilingual training corpora with automatic, zero-cost component labels. Furthermore, we construct BASA-Bench, a representative benchmark spanning 11 languages with diverse script structures and 23 authentic scenarios. Experiments demonstrate that BASA achieves consistent improvements over strong OCR baselines, particularly on scripts with complex compositions. Our model and benchmark will be available at https://github.com/NcutLLM/BASA.
Anthology ID:
2026.acl-long.1392
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
30169–30185
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1392/
DOI:
Bibkey:
Cite (ACL):
Mengxiao Zhu, Haixu Chen, Jiu Sha, Jie Liu, and Ge Shi. 2026. Beyond Atomic Characters: Glyph-Aware Sub-character Alignment for Low-Resource Multilingual OCR. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 30169–30185, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Beyond Atomic Characters: Glyph-Aware Sub-character Alignment for Low-Resource Multilingual OCR (Zhu et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1392.pdf
Checklist:
 2026.acl-long.1392.checklist.pdf