Challenge Track: Breaking Language Barriers: Adapting NLLB-200 and mBART for Bhilli, Gondi, Mundari, and Santali Without Source Language Proficiency

Paul Kamau


Abstract
This paper presents a language-agnostic approach to neural machine translation for low-resource Indian tribal languages: Bhilli, Gondi, Mundari, and Santali. Developed under the constraint of zero proficiency in the source languages, the methodology relies on the cross-lingual transfer capabilities of two foundation models, NLLB-200 and mBART-50. The approach employs a unified bidirectional fine-tuning strategy to maximize limited parallel corpora. A primary contribution of this work is a smart post-processing pipeline and a “conservative ensemble” mechanism. This mechanism integrates predictions from a secondary model specifically as a safety net to mitigate hallucinations and length-ratio artifacts generated by the primary model. The approach achieved a private leaderboard score of 179.49 in the MMLoSo 2025 Language Challenge. These findings demonstrate that effective translation systems for underrepresented languages can be engineered without native linguistic intuition by leveraging data-centric validation and the latent knowledge within massive multilingual models
Anthology ID:
2025.mmloso-1.11
Volume:
Proceedings of the 1st Workshop on Multimodal Models for Low-Resource Contexts and Social Impact (MMLoSo 2025)
Month:
December
Year:
2025
Address:
Mumbai, India
Editors:
Ankita Shukla, Sandeep Kumar, Amrit Singh Bedi, Tanmoy Chakraborty
Venues:
MMLoSo | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
106–108
Language:
URL:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.mmloso-1.11/
DOI:
Bibkey:
Cite (ACL):
Paul Kamau. 2025. Challenge Track: Breaking Language Barriers: Adapting NLLB-200 and mBART for Bhilli, Gondi, Mundari, and Santali Without Source Language Proficiency. In Proceedings of the 1st Workshop on Multimodal Models for Low-Resource Contexts and Social Impact (MMLoSo 2025), pages 106–108, Mumbai, India. Association for Computational Linguistics.
Cite (Informal):
Challenge Track: Breaking Language Barriers: Adapting NLLB-200 and mBART for Bhilli, Gondi, Mundari, and Santali Without Source Language Proficiency (Kamau, MMLoSo 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.mmloso-1.11.pdf