2025
pdf
bib
abs
Leveraging the Cross-Domain & Cross-Linguistic Corpus for Low Resource NMT: A Case Study On Bhili-Hindi-English Parallel Corpus
Pooja Singh
|
Shashwat Bhardwaj
|
Vaibhav Sharma
|
Sandeep Kumar
Findings of the Association for Computational Linguistics: EMNLP 2025
The linguistic diversity of India poses significant machine translation challenges, especially for underrepresented tribal languages like Bhili, which lack high-quality linguistic resources. This paper addresses the gap by introducing Bhili-Hindi-English Parallel Corpus (BHEPC), the first and largest parallel corpus worldwide comprising 110,000 meticulously curated sentences across Bhili, Hindi, and English. The corpus was created with the assistance of expert human translators. BHEPC spans critical domains such as education, administration, and news, establishing a valuable benchmark for research in low resource machine translation. To establish a comprehensive Bhili Machine Translation benchmark, we evaluated a wide range of proprietary and open-source Multilingual Large Language Models (MLLMs) on bidirectional translation tasks between English/Hindi and Bhili. Comprehensive evaluation demonstrates that the fine-tuned NLLB-200 distilled 600M variant model outperforms others, highlighting the potential of multilingual models in low resource scenarios. Furthermore, we investigated the generative translation capabilities of multilingual LLMs on BHEPC using in-context learning, assessing performance under cross-domain generalization and quantifying distributional divergence. This work bridges a critical resource gap and promotes inclusive natural language processing technologies for low-resource and marginalized languages globally.
pdf
bib
abs
GARuD: Guided Alignment of Representations using Distillation for Ultra-Low-Resource Languages
Debarchan Basu
|
Shashwat Bhardwaj
|
Vaibhav Sharma
|
Pooja Singh
|
Sandeep Kumar
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
The vast majority of the world’s languages, particularly low-resource and indigenous ones like Bhili, remain critically underserved by modern language technologies. The primary bottleneck is the lack of large-scale corpora required for standard pre-training. To address this gap, we introduce cross-lingual contrastive distillation, a novel and data-efficient, compute-efficient paradigm for creating powerful language models without a massive monolingual corpus. Our method adapts a pre-existing multilingual model (MuRIL) by using a fixed, expert teacher model (HindBERT) to distill semantic knowledge from a related high-resource language (Hindi) via a contrastive objective on a modest parallel corpus. Through comprehensive experiments, we show that our resulting model, GARuD-Bhili, significantly outperforms strong zero-shot and MLM-only baselines on a suite of evaluations, including intrinsic language modeling, downstream sentiment analysis, and cross-lingual benchmarks (Tatoeba, XNLI). Our work presents a generalizable and scalable blueprint for linguistic empowerment, offering a practical pathway to develop robust language technologies for other underserved languages globally.
pdf
bib
abs
Findings of the MMLoSo 2025 Shared Task on Machine Translation into Tribal Languages
Pooja Singh
|
Sandeep Chatterjee
|
Gullal S. Cheema
|
Amrit Singh Bedi
|
Tanmoy Chakraborty
|
Sandeep Kumar
|
Ankita Shukla
Proceedings of the 1st Workshop on Multimodal Models for Low-Resource Contexts and Social Impact (MMLoSo 2025)
This paper presents the findings of the MMLoSo Shared Task on Machine Translation. The competition features four tribal languages from India: Bhili, Mundari, Gondi, and Santali, each with 20,000 high-quality parallel sentence pairs and a 16,000-sentence evaluation set. A total of 18 teams submitted across all language pairs. The shared task addresses the challenges of translating India’s severely low-resource tribal languages, which, despite having millions of speakers, remain digitally marginalized due to limited textual resources, diverse scripts, rich morphology, and minimal publicly available parallel corpora. Systems were ranked using a weighted composite score combining BLEU (60%) and chrF (40%) to balance structural accuracy and character-level fluency. The best-performing system leveraged IndicTrans2 with directional LoRA adapters and reverse-model reranking. This work establishes the first reproducible benchmark for machine translation in these tribal languages. All datasets, baseline models, and system outputs are publicly released to support continued research in India’s tribal language technologies.
2012
pdf
bib
abs
Development of Text and Speech database for Hindi and Indian English specific to Mobile Communication environment
Shyam Agrawal
|
Shweta Sinha
|
Pooja Singh
|
Jesper Olson
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
This paper describes the method and experiences of text and speech data collection in mobile communication in Indian English Hindi. The primary data collection is done in the form of large number of messages as part of Personal communication among natives of Hindi language and Indian speakers of English. To gather the versatility of mobile communication database among Hindi and English, 12 domains were identified for collection of text corpus from speaking population belonging to deferent age groups, sex and dialects. The text obtained in raw form based on slangs and unconventional grammar were cleaned using on language grammar rules and then tagged and expanded to explain context specific meaning of the words. Texts of 1163 participants from Hindi speaking regions and 1405 English users were taken for creating 13 prompt sheets; containing 630 phonetically rich sentences created using a special software. Each prompt sheet was recorded by at least 7 users simultaneously in three channels and recorded by a total of 100 speakers and annotated. The work is a step forward in the direction of development of standards for mobile text and speech data collection for Indian languages. Keywords - Speech data base, Text analysis, mobile communication, Hindi and Indian English Speech, multi-lingual speech processing.