Hemanta Baruah

2026

In India, the official language for writing judgments in higher courts is English, which creates a language barrier for citizens not proficient in English. Machine Translation (MT) provides a scalable solution, but its progress for low-resource languages like Assamese is significantly limited due to the lack of legal domain data. To address this gap, we introduce the first-of-its-kind English-Assamese parallel corpus for the translation of Indian court judgments. This dataset consists of over 55,000 manually translated and validated sentence pairs from over 500 judgments of the Gauhati High Court and the Supreme Court of India. Using this dataset, we perform a comprehensive evaluation of state-of-the-art multilingual models, including NLLB-200 and Sarvam-Translate, in both zero-shot and fine-tuned settings, comparing their performance against commercial systems. Our experiments show that fine-tuning on our legal-domain dataset significantly improves the translation quality. We also conduct a thorough error analysis that points out important issues in legal translation. These include precisely translating legal terms, properly transliterating named entities, expanding abbreviations, and transforming sentence structures, such as changing passive voice to active voice, when translating from English to Assamese. By creating a publicly available dataset and examining the specific challenges, this work offers a reproducible foundation and a clear way to develop more accurate and reliable legal machine translation systems. This will help improve access to justice for Assamese speakers.

2024

pdf bib abs

AssameseBackTranslit: Back Transliteration of Romanized Assamese Social Media Text
Hemanta Baruah | Sanasam Ranbir Singh | Priyankoo Sarmah
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper presents a novel back transliteration dataset capturing native language text originally composed in the Roman/Latin script, harvested from popular social media platforms, along with its corresponding representation in the native Assamese script. Assamese, categorized as a low-resource language within the Indo-Aryan language family, predominantly spoken in the north-east Indian state of Assam, faces a scarcity of linguistic resources. The dataset comprises a total of 60,312 Roman-native parallel transliterated sentences. This paper diverges from conventional forward transliteration datasets consisting mainly of named entities and technical terms, instead presenting a novel transliteration dataset cultivated from three prominent social media platforms, Facebook, Twitter(currently X), and YouTube, in the backward transliteration direction. The paper offers a comprehensive examination of ten state-of-the-art word-level transliteration models within the context of this dataset, encompassing transliteration evaluation benchmarks, extensive performance assessments, and a discussion of the unique challenges encountered during the processing of transliterated social media content. Our approach involves the initial use of two statistical transliteration models, followed by the training of two state-of-the-art neural network-based transliteration models, evaluation of three publicly available pre-trained models, and ultimately fine-tuning one existing state-of-the-art multilingual transliteration model along with two pre-trained large language models using the collected datasets. Notably, the Neural Transformer model outperforms all other baseline transliteration models, achieving the lowest Word Error Rate (WER) and Character Error Rate (CER), and the highest BLEU (up to 4 gram) score of 55.05, 19.44, and 69.15, respectively.

Hemanta Baruah

2026

2024

2023

Co-authors

Venues