Telem Joyson Singh


2026

Bilingual Dictionary Induction (BDI) presents significant challenges in distant language pairs, particularly in light of the non-isomorphic nature and complexity of linguistic structures. This paper systematically evaluates the performance of unsupervised, supervised fine-tuning, and few-shot prompting approaches on BDI using Large Language Models (LLMs) on a diverse set of distant language pairs. The unsupervised approach explores the inherent multilingual capabilities of LLMs without fine-tuning, while the supervised fine-tuning method utilizes extensive labeled datasets to train models explicitly for BDI tasks. On the other hand, few-shot prompting leverages minimal examples to elicit accurate responses from the LLMs in a zero-shot or few-shot learning paradigm. Our experimental results reveal that the 5-shot prompting approach outperforms unsupervised and zero-shot settings in all cases and surpasses supervised settings in 82.86% of the cases. Few-shot prompting demonstrates robustness against overfitting, leveraging LLMs’ in-context learning and multilingual capabilities, making it particularly effective in target-to-source translation, even for morphologically complex language pairs. At the same time, few-shot prompting in LLM models, such as Llama, remains ineffective for morphologically rich language pairs like En-Mn and En-Ta in source-to-target BDI tasks. These findings suggest that few-shot prompting is a cost-effective and powerful alternative for BDI tasks, with future work enhancing BDI tasks in morphologically rich pairs.
In India, the official language for writing judgments in higher courts is English, which creates a language barrier for citizens not proficient in English. Machine Translation (MT) provides a scalable solution, but its progress for low-resource languages like Assamese is significantly limited due to the lack of legal domain data. To address this gap, we introduce the first-of-its-kind English-Assamese parallel corpus for the translation of Indian court judgments. This dataset consists of over 55,000 manually translated and validated sentence pairs from over 500 judgments of the Gauhati High Court and the Supreme Court of India. Using this dataset, we perform a comprehensive evaluation of state-of-the-art multilingual models, including NLLB-200 and Sarvam-Translate, in both zero-shot and fine-tuned settings, comparing their performance against commercial systems. Our experiments show that fine-tuning on our legal-domain dataset significantly improves the translation quality. We also conduct a thorough error analysis that points out important issues in legal translation. These include precisely translating legal terms, properly transliterating named entities, expanding abbreviations, and transforming sentence structures, such as changing passive voice to active voice, when translating from English to Assamese. By creating a publicly available dataset and examining the specific challenges, this work offers a reproducible foundation and a clear way to develop more accurate and reliable legal machine translation systems. This will help improve access to justice for Assamese speakers.

2025

Large language models (LLMs) have transformed machine translation, yet they have a high subword fertility issue for low-resource languages, which leads to slow inference speed and increased costs. While vocabulary expansion via continual pre-training is a common solution, it often degrades translation quality and requires large target-language corpora, which are unavailable for truly low-resource languages. To address this, we investigate tokenization efficiency through an information-theoretic lens, building on the established hypothesis that word length correlates with information content. From this perspective, we characterize tokenization inefficiency as having high fertility for low-information (highly predictable) words. Guided by this principle, we introduce a novel fine-tuning strategy that systematically identifies informationally redundant words—those with high fertility but low information content—for targeted vocabulary expansion and model fine-tuning. Experiments fine-tuning BLOOM and LLaMA-3 in English-Manipuri and other two language pairs show that our proposed method significantly reduces fertility by 50% and accelerates inference by more than 2 times, without compromising and often exceeding the translation quality of standard LLM baselines, providing a theoretically grounded solution for efficient LLM-based MT.

2023

Machine translation of Indian languages is challenging due to several factors, including linguistic diversity, limited parallel data, language divergence, and complex morphology. Recently, large pre-trained multilingual models have shown promise in improving translation quality. In this paper, we conduct a large-scale study on applying large pre-trained models for English-Indic machine translation through transfer learning across languages and domains. This study systematically evaluates the practical gains these models can provide and analyzes their capabilities for the translation of the Indian language by transfer learning. Specifically, we experiment with several models, including Meta’s mBART, mBART-manyto-many, NLLB-200, M2M-100, and Google’s MT5. These models are fine-tuned on small, high-quality English-Indic parallel data across languages and domains. Our findings show that adapting large pre-trained models to particular languages by fine-tuning improves translation quality across the Indic languages, even for languages unseen during pretraining. Domain adaptation through continued fine-tuning improves results. Our study provides insights into utilizing large pretrained models to address the distinct challenges of MT of Indian languages.