Dhruvkumar Babubhai Kakadiya
2026
Sentence-Level Back-Transliteration of Romanized Indian Languages: Performance Analysis and Challenges
Saurabh Kumar | Dhruvkumar Babubhai Kakadiya | Sanasam Ranbir Singh | Sukumar Nandi
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Saurabh Kumar | Dhruvkumar Babubhai Kakadiya | Sanasam Ranbir Singh | Sukumar Nandi
Proceedings of the Fifteenth Language Resources and Evaluation Conference
The widespread use of Romanized text for Indian languages, particularly on social media platforms, poses significant challenges for natural language processing due to the lack of standardized orthography and the presence of contextual ambiguities. In this study, we explore sentence-level back-transliteration for 13 Indian languages, focusing on addressing the limitations of word-level models that fail to capture contextual dependencies. We evaluate state-of-the-art models, including fine-tuned LLaMA, mT5, and Multilingual Transformer models, comparing their performance against the baseline IndicXlit model. In addition, we conduct a comprehensive error analysis to gain deeper insights into model performance. Our results demonstrate that fine-tuned LLaMA and the proposed IndiXform model, specifically designed to leverage sentence-level context, significantly outperform zero-shot LLaMA and the IndicXlit baseline. These findings provide valuable insights into handling contextual ambiguities and enhancing the accuracy of back-transliteration systems for Indian languages.
2025
Team IndiDataMiner at IndoNLP 2025: Hindi Back Transliteration - Roman to Devanagari using LLaMa
Saurabh Kumar | Dhruvkumar Babubhai Kakadiya | Sanasam Ranbir Singh
Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages
Saurabh Kumar | Dhruvkumar Babubhai Kakadiya | Sanasam Ranbir Singh
Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages
The increasing use of Romanized typing for Indo-Aryan languages on social media poses challenges due to its lack of standardization and loss of linguistic richness. To address this, we propose a sentence-level back-transliteration approach using the LLaMa 3.1 model for Hindi. Leveraging fine-tuning with the Dakshina dataset, our approach effectively resolves ambiguities in Romanized Hindi text, offering a robust solution for converting it into the native Devanagari script.