Samarth P


2025

pdf bib
Nayana OCR: A Scalable Framework for Document OCR in Low-Resource Languages
Adithya Kolavi | Samarth P | Vyoman Jain
Proceedings of the 1st Workshop on Language Models for Underserved Communities (LM4UC 2025)

We introduce Nayana, a scalable and efficient framework for adapting Vision-Language Models (VLMs) to low-resource languages. Despite significant advances, modern VLMs remain constrained by the scarcity of training data in non-English languages, limiting their global applicability. Our framework addresses this fundamental challenge through a novel layout-aware synthetic data generation pipeline combined with parameter-efficient adaptation techniques. Instead of requiring extensive manually annotated datasets, Nayana enables existing models to learn new languages effectively using purely synthetic data. Using Low-Rank Adaptation (LoRA), we demonstrate this capability across ten Indic languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, and Telugu. Through extensive experiments in OCR tasks, we show that models can achieve strong performance in new languages without the traditional requirements of large-scale annotated datasets or extensive model modifications. Nayana’s success in adapting VLMs to new languages with synthetic data establishes a practical pathway for extending AI capabilities to underserved languages, particularly in scenarios where annotated data is scarce or unavailable.

pdf bib
The Gemma Sutras: Fine-Tuning Gemma 3 for Sanskrit Sandhi Splitting
Samarth P | Sanjay Balaji Mahalingam
Proceedings of the 9th Widening NLP Workshop

Sandhi, the phonological merging of morphemes, is a central feature of Sanskrit grammar. While Sandhi formation is well-defined by Pāṇini’s Aṣṭādhyāyī, the reverse task—Sandhi splitting—is substantially more complex due to inherent ambiguity and context-sensitive transformations. Accurate splitting is a critical precursor to tokenization in Sanskrit, which lacks explicit word boundaries and presents densely fused compounds. In this work, we present a data-driven approach, fine-tuning the Gemma-3 4B large language model on a dataset of over 49,000 training and 2,000 test examples of compound words and their morpheme-level decompositions. Leveraging the Unsloth framework with low-rank adaptation (LoRA) and 4-bit quantization, we train the model to predict these splits. Our work yields a scalable, Sandhi-aware system designed to enhance modern NLP pipelines for classical Sanskrit, demonstrating an effective application of LLMs to this linguistic challenge.