Dhruvkumar Babubhai Kakadiya


2026

The widespread use of Romanized text for Indian languages, particularly on social media platforms, poses significant challenges for natural language processing due to the lack of standardized orthography and the presence of contextual ambiguities. In this study, we explore sentence-level back-transliteration for 13 Indian languages, focusing on addressing the limitations of word-level models that fail to capture contextual dependencies. We evaluate state-of-the-art models, including fine-tuned LLaMA, mT5, and Multilingual Transformer models, comparing their performance against the baseline IndicXlit model. In addition, we conduct a comprehensive error analysis to gain deeper insights into model performance. Our results demonstrate that fine-tuned LLaMA and the proposed IndiXform model, specifically designed to leverage sentence-level context, significantly outperform zero-shot LLaMA and the IndicXlit baseline. These findings provide valuable insights into handling contextual ambiguities and enhancing the accuracy of back-transliteration systems for Indian languages.

2025

The increasing use of Romanized typing for Indo-Aryan languages on social media poses challenges due to its lack of standardization and loss of linguistic richness. To address this, we propose a sentence-level back-transliteration approach using the LLaMa 3.1 model for Hindi. Leveraging fine-tuning with the Dakshina dataset, our approach effectively resolves ambiguities in Romanized Hindi text, offering a robust solution for converting it into the native Devanagari script.