MaitH 1.0: A Parallel Corpus and Baseline for Low-Resource Maithili-Hindi Translation

Kamanksha Prasad Dubey; Chandresh Kumar Maurya; Kumar Padmanabh

MaitH 1.0: A Parallel Corpus and Baseline for Low-Resource Maithili-Hindi Translation

Kamanksha Prasad Dubey, Chandresh Maurya, Kumar Padmanabh

Abstract

Maithili is one of the 22 official languages recognized in the Indian Constitution. The literature of Maithili is rich; however, due to current socio-political changes, the language is on the verge of extinction. Therefore, it is crucial to develop a corpus for low-resource Indic languages like Maithili to ensure that the dream of “No Language Left Behind" (NLLB) is realized. With this in mind, we contribute a corpus (1,05,600 sentences) containing both manually curated and synthetically generated. Additionally, we propose a strong baseline on the Maithali-Hindi pair using multilingual pretrained models such as IndicTrans2, mBART50, mT5, and NLLB-200 distilled. We evaluate the translation systems using standard performance metrics, including BLEU, CHRF2, TER, COMET, METEOR, and BERTScore. Comparative experiments conducted against the existing NLLB dataset (5,50,300 sentence pairs) demonstrate that our proposed dataset consistently yields superior translation quality. Finally, these results demonstrate that, even with a smaller corpus size, high-quality, task-specific data significantly enhance translation accuracy for low-resource Indian languages, such as Maithili.

Anthology ID:: 2026.lrec-main.676
Volume:: Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:: May
Year:: 2026
Address:: Palma de Mallorca, Spain
Editors:: Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:: LREC
SIG:
Publisher:: ELRA Language Resource Association
Note:
Pages:: 8567–8576
Language:
URL:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.676/
DOI:
Bibkey:
Cite (ACL):: Kamanksha Prasad Dubey, Chandresh Maurya, and Kumar Padmanabh. 2026. MaitH 1.0: A Parallel Corpus and Baseline for Low-Resource Maithili-Hindi Translation. International Conference on Language Resources and Evaluation, main:8567–8576.
Cite (Informal):: MaitH 1.0: A Parallel Corpus and Baseline for Low-Resource Maithili-Hindi Translation (Dubey et al., LREC 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.676.pdf

PDF Cite Search Fix data