2025
pdf
bib
abs
An Information-Theoretic Approach to Reducing Fertility in LLMs for Manipuri Machine Translation
Telem Joyson Singh
|
Ranbir Singh Sanasam
|
Priyankoo Sarmah
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Large language models (LLMs) have transformed machine translation, yet they have a high subword fertility issue for low-resource languages, which leads to slow inference speed and increased costs. While vocabulary expansion via continual pre-training is a common solution, it often degrades translation quality and requires large target-language corpora, which are unavailable for truly low-resource languages. To address this, we investigate tokenization efficiency through an information-theoretic lens, building on the established hypothesis that word length correlates with information content. From this perspective, we characterize tokenization inefficiency as having high fertility for low-information (highly predictable) words. Guided by this principle, we introduce a novel fine-tuning strategy that systematically identifies informationally redundant words—those with high fertility but low information content—for targeted vocabulary expansion and model fine-tuning. Experiments fine-tuning BLOOM and LLaMA-3 in English-Manipuri and other two language pairs show that our proposed method significantly reduces fertility by 50% and accelerates inference by more than 2 times, without compromising and often exceeding the translation quality of standard LLM baselines, providing a theoretically grounded solution for efficient LLM-based MT.
2023
pdf
bib
abs
Can Big Models Help Diverse Languages? Investigating Large Pretrained Multilingual Models for Machine Translation of Indian Languages
Telem Joyson Singh
|
Sanasam Ranbir Singh
|
Priyankoo Sarmah
Proceedings of the 20th International Conference on Natural Language Processing (ICON)
Machine translation of Indian languages is challenging due to several factors, including linguistic diversity, limited parallel data, language divergence, and complex morphology. Recently, large pre-trained multilingual models have shown promise in improving translation quality. In this paper, we conduct a large-scale study on applying large pre-trained models for English-Indic machine translation through transfer learning across languages and domains. This study systematically evaluates the practical gains these models can provide and analyzes their capabilities for the translation of the Indian language by transfer learning. Specifically, we experiment with several models, including Meta’s mBART, mBART-manyto-many, NLLB-200, M2M-100, and Google’s MT5. These models are fine-tuned on small, high-quality English-Indic parallel data across languages and domains. Our findings show that adapting large pre-trained models to particular languages by fine-tuning improves translation quality across the Indic languages, even for languages unseen during pretraining. Domain adaptation through continued fine-tuning improves results. Our study provides insights into utilizing large pretrained models to address the distinct challenges of MT of Indian languages.
pdf
bib
Subwords to Word Back Composition for Morphologically Rich Languages in Neural Machine Translation
Telem Joyson Singh
|
Sanasam Ranbir Singh
|
Priyankoo Sarmah
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation