Nayana OCR: A Scalable Framework for Document OCR in Low-Resource Languages

Adithya Kolavi; Samarth P; Vyoman Jain

Nayana OCR: A Scalable Framework for Document OCR in Low-Resource Languages

Abstract

We introduce Nayana, a scalable and efficient framework for adapting Vision-Language Models (VLMs) to low-resource languages. Despite significant advances, modern VLMs remain constrained by the scarcity of training data in non-English languages, limiting their global applicability. Our framework addresses this fundamental challenge through a novel layout-aware synthetic data generation pipeline combined with parameter-efficient adaptation techniques. Instead of requiring extensive manually annotated datasets, Nayana enables existing models to learn new languages effectively using purely synthetic data. Using Low-Rank Adaptation (LoRA), we demonstrate this capability across ten Indic languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, and Telugu. Through extensive experiments in OCR tasks, we show that models can achieve strong performance in new languages without the traditional requirements of large-scale annotated datasets or extensive model modifications. Nayana’s success in adapting VLMs to new languages with synthetic data establishes a practical pathway for extending AI capabilities to underserved languages, particularly in scenarios where annotated data is scarce or unavailable.

Anthology ID:: 2025.lm4uc-1.11
Volume:: Proceedings of the 1st Workshop on Language Models for Underserved Communities (LM4UC 2025)
Month:: May
Year:: 2025
Address:: Albuquerque, New Mexico
Editor:: Duc Nguyen
Venues:: LM4UC | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 86–103
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2025.lm4uc-1.11/
DOI:
Bibkey:
Cite (ACL):: Adithya Kolavi, Samarth P, and Vyoman Jain. 2025. Nayana OCR: A Scalable Framework for Document OCR in Low-Resource Languages. In Proceedings of the 1st Workshop on Language Models for Underserved Communities (LM4UC 2025), pages 86–103, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: Nayana OCR: A Scalable Framework for Document OCR in Low-Resource Languages (Kolavi et al., LM4UC 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2025.lm4uc-1.11.pdf

PDF Cite Search Fix data