2025
pdf
bib
abs
Data and Model Centric Approaches for Expansion of Large Language Models to New languages
Anoop Kunchukuttan
|
Raj Dabre
|
Rudra Murthy
|
Mohammed Safi Ur Rahman Khan
|
Thanmay Jayakumar
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts
Despite the increasing pace of Large Language Model (LLM) research, a vast majority of existing LLMs mainly support English alongside a handful of high resource languages, leaving a major gap for most low-resource languages. In this tutorial, we focus on approaches to expand the language coverage of LLMs. This provides an efficient and viable path to bring LLM technologies to low-resource languages, instead of training from scratch. We look at approaches at various stages of the LLM training pipeline, like tokenizer training, pre-training, instruction tuning, alignment, evaluation, etc., where adaptations are made to support new languages. We look at data-oriented approaches as well as model-oriented approaches. We hope that our tutorial enables researchers and practitioners to work on incorporating additional languages and tasks into existing LLMs to enhance inclusivity and coverage.
pdf
bib
abs
RomanLens: The Role Of Latent Romanization In Multilinguality In LLMs
Alan Saji
|
Jaavid Aktar Husain
|
Thanmay Jayakumar
|
Raj Dabre
|
Anoop Kunchukuttan
|
Ratish Puduppully
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models (LLMs) exhibit strong multilingual performance despite being predominantly trained on English-centric corpora. This raises a fundamental question: How do LLMs achieve such multilingual capabilities? Focusing on languages written in non-Roman scripts, we investigate the role of Romanization—the representation of non-Roman scripts using Roman characters—as a potential bridge in multilingual processing. Using mechanistic interpretability techniques, we analyze next-token generation and find that intermediate layers frequently represent target words in Romanized form before transitioning to native script, a phenomenon we term Latent Romanization. Further, through activation patching experiments, we demonstrate that LLMs encode semantic concepts similarly across native and Romanized scripts, suggesting a shared underlying representation. Additionally, for translation into non-Roman script languages, our findings reveal that when the target language is in Romanized form, its representations emerge earlier in the model’s layers compared to native script. These insights contribute to a deeper understanding of multilingual representation in LLMs and highlight the implicit role of Romanization in facilitating language transfer.
pdf
bib
abs
CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation
Emilio Villa-Cueva
|
Sholpan Bolatzhanova
|
Diana Turmakhan
|
Kareem Elzeky
|
Henok Biadglign Ademtew
|
Alham Fikri Aji
|
Vladimir Araujo
|
Israel Abebe Azime
|
Jinheon Baek
|
Frederico Belcavello
|
Fermin Cristobal
|
Jan Christian Blaise Cruz
|
Mary Dabre
|
Raj Dabre
|
Toqeer Ehsan
|
Naome A Etori
|
Fauzan Farooqui
|
Jiahui Geng
|
Guido Ivetta
|
Thanmay Jayakumar
|
Soyeong Jeong
|
Zheng Wei Lim
|
Aishik Mandal
|
Sofía Martinelli
|
Mihail Minkov Mihaylov
|
Daniil Orel
|
Aniket Pramanick
|
Sukannya Purkayastha
|
Israfel Salazar
|
Haiyue Song
|
Tiago Timponi Torrent
|
Debela Desalegn Yadeta
|
Injy Hamed
|
Atnafu Lambebo Tonja
|
Thamar Solorio
Findings of the Association for Computational Linguistics: EMNLP 2025
Translating cultural content poses challenges for machine translation systems due to the differences in conceptualizations between cultures, where language alone may fail to convey sufficient context to capture region-specific meanings. In this work, we investigate whether images can act as cultural context in multimodal translation. We introduce CaMMT, a human-curated benchmark of over 5,800 triples of images along with parallel captions in English and regional languages. Using this dataset, we evaluate five Vision Language Models (VLMs) in text-only and text+image settings. Through automatic and human evaluations, we find that visual context generally improves translation quality, especially in handling Culturally-Specific Items (CSIs), disambiguation, and correct gender marking. By releasing CaMMT, our objective is to support broader efforts to build and evaluate multimodal translation systems that are better aligned with cultural nuance and regional variations.
2024
pdf
bib
abs
RomanSetu: Efficiently unlocking multilingual capabilities of Large Language Models via Romanization
Jaavid J
|
Raj Dabre
|
Aswanth M
|
Jay Gala
|
Thanmay Jayakumar
|
Ratish Puduppully
|
Anoop Kunchukuttan
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This study addresses the challenge of extending Large Language Models (LLMs) to non-English languages, specifically those using non-Roman scripts. We propose an approach that utilizes the romanized form of text as an interface for LLMs, hypothesizing that its frequent informal use and shared tokens with English enhance cross-lingual alignment. Our approach involve the continual pretraining of a English LLM like Llama 2 on romanized text of non-English, non-Roman script languages, followed by instruction tuning on romanized data. The results indicate that romanized text not only reduces token fertility by 2x-4x but also matches if not outperforms native script representation across various NLU, NLG and MT tasks. Moreover, the embeddings computed on romanized text exhibit closer alignment with their English translations than those from the native script. Our approach presents a promising direction for leveraging the power of English LLMs in languages traditionally underrepresented in NLP research.
pdf
bib
abs
Leveraging Linguistically Enhanced Embeddings for Open Information Extraction
Fauzan Nayeem Farooqui
|
Thanmay Jayakumar
|
Pulkit Mathur
|
Mansi A. Radke
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Open Information Extraction (OIE) is a structure prediction (SP) task in Natural Language Processing (NLP) that aims to extract structured n-ary tuples - usually subject-relation-object triples - from free text. The word embeddings in the input text can be enhanced with linguistic features, usually Part-of-Speech (PoS) and Syntactic Dependency Parse (SynDP) labels. However, past enhancement techniques cannot leverage the power of pre-trained language models (PLMs), which themselves have been hardly used for OIE. To bridge this gap, we are the first to leverage linguistic features with a Seq2Seq PLM for OIE. We do so by introducing two methods - Weighted Addition and Linearized Concatenation. Our work gives any neural OIE architecture the key performance boost from both PLMs and linguistic features in one go. In our settings, this shows wide improvements of up to 24.9%, 27.3% and 14.9% on Precision, Recall and F1 scores respectively over the baseline. Beyond this, we address other important challenges in the field: to reduce compute overheads with the features, we are the first ones to exploit Semantic Dependency Parse (SemDP) tags; to address flaws in current datasets, we create a clean synthetic dataset; finally, we contribute the first known study of OIE behaviour in SP models.
2023
pdf
bib
abs
Large Language Models are legal but they are not: Making the case for a powerful LegalLLM
Thanmay Jayakumar
|
Fauzan Farooqui
|
Luqman Farooqui
Proceedings of the Natural Legal Language Processing Workshop 2023
Realizing the recent advances from Natural Language Processing (NLP) to the legal sector poses challenging problems such as extremely long sequence lengths, specialized vocabulary that is usually only understood by legal professionals, and high amounts of data imbalance. The recent surge of Large Language Models (LLM) has begun to provide new opportunities to apply NLP in the legal domain due to their ability to handle lengthy, complex sequences. Moreover, the emergence of domain-specific LLMs has displayed extremely promising results on various tasks. In this study, we aim to quantify how general LLMs perform in comparison to legal-domain models (be it an LLM or otherwise). Specifically, we compare the zero-shot performance of three general-purpose LLMs (ChatGPT-3.5, LLaMA-70b and Falcon-180b) on the LEDGAR subset of the LexGLUE benchmark for contract provision classification. Although the LLMs were not explicitly trained on legal data, we observe that they are still able to classify the theme correctly in most cases. However, we find that their mic-F1/mac-F1 performance are upto 19.2/26.8% lesser than smaller models fine-tuned on the legal domain, thus underscoring the need for more powerful legal-domain LLMs.