AssamLegalTrans: A Parallel Corpus, Benchmark and Analysis for English-Assamese Machine Translation of Legal Judgments

Telem Joyson Singh, Hemanta Baruah, Sanasam Ranbir Singh, Anindita Talukdar, Nasrin Shahnaz, Okram Jimmy Singh, Priyankoo Sarmah, Pallav Kumar Dutta, Sukumar Nandi, Pranab Duara


Abstract
In India, the official language for writing judgments in higher courts is English, which creates a language barrier for citizens not proficient in English. Machine Translation (MT) provides a scalable solution, but its progress for low-resource languages like Assamese is significantly limited due to the lack of legal domain data. To address this gap, we introduce the first-of-its-kind English-Assamese parallel corpus for the translation of Indian court judgments. This dataset consists of over 55,000 manually translated and validated sentence pairs from over 500 judgments of the Gauhati High Court and the Supreme Court of India. Using this dataset, we perform a comprehensive evaluation of state-of-the-art multilingual models, including NLLB-200 and Sarvam-Translate, in both zero-shot and fine-tuned settings, comparing their performance against commercial systems. Our experiments show that fine-tuning on our legal-domain dataset significantly improves the translation quality. We also conduct a thorough error analysis that points out important issues in legal translation. These include precisely translating legal terms, properly transliterating named entities, expanding abbreviations, and transforming sentence structures, such as changing passive voice to active voice, when translating from English to Assamese. By creating a publicly available dataset and examining the specific challenges, this work offers a reproducible foundation and a clear way to develop more accurate and reliable legal machine translation systems. This will help improve access to justice for Assamese speakers.
Anthology ID:
2026.lrec-main.386
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
4921–4930
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.386/
DOI:
Bibkey:
Cite (ACL):
Telem Joyson Singh, Hemanta Baruah, Sanasam Ranbir Singh, Anindita Talukdar, Nasrin Shahnaz, Okram Jimmy Singh, Priyankoo Sarmah, Pallav Kumar Dutta, Sukumar Nandi, and Pranab Duara. 2026. AssamLegalTrans: A Parallel Corpus, Benchmark and Analysis for English-Assamese Machine Translation of Legal Judgments. International Conference on Language Resources and Evaluation, main:4921–4930.
Cite (Informal):
AssamLegalTrans: A Parallel Corpus, Benchmark and Analysis for English-Assamese Machine Translation of Legal Judgments (Singh et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.386.pdf