A thresholding method for Improving translation Quality for Indic MT task
Sudhansu Bala Das, Leo Raphael Rodrigues, Tapas Kumar Mishra, Bidyut Ku Patra
Abstract
The conversion of content from one language to another using a computer system is known as Machine Translation (MT). Various techniques have been used to ensure effective translations that retain the contextual and lexical interpretation of the source and target languages. One of these methods is end-to-end Neural Machine Translation (NMT), which is frequently utilized in real-world machine translation systems. NMT requires large parallel datasets for effective translation. These datasets are essential for an MT system to acquire during the training phase to learn the linguistic patterns and structures of both languages. One such dataset is Samanantar, the largest publicly accessible parallel dataset for Indian languages (ILs). Since these datasets have been gathered from various sources, they contain many incorrect or dissimilar translations. Hence, the MT systems built using this dataset cannot perform to their usual potential. This paper proposes an algorithm to remove dissimilar translations from the training dataset and evaluate the model’s efficiency. Two Indic languages (ILs), Hindi (HIN) and Odia (ODI), were chosen for the experiment. A baseline NMT system is built for these languages, and the effect of different dataset sizes is investigated. The quality of the translations in the experiment is evaluated using standard metrics. The results have shown that removing the dissimilar translations from the training dataset improves the quality of the language. It is also noticed that, despite the fact that the ILs-English and English-ILs systems are trained using the same dataset, ILs-English works more effectively across all the evaluation metrics.- Anthology ID:
- 2025.lowresnlp-1.3
- Volume:
- Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages
- Month:
- September
- Year:
- 2025
- Address:
- Varna, Bulgaria
- Editors:
- Ernesto Luis Estevanell-Valladares, Alicia Picazo-Izquierdo, Tharindu Ranasinghe, Besik Mikaberidze, Simon Ostermann, Daniil Gurgurov, Philipp Mueller, Claudia Borg, Marián Šimko
- Venues:
- LowResNLP | WS
- SIG:
- Publisher:
- INCOMA Ltd., Shoumen, Bulgaria
- Note:
- Pages:
- 12–20
- Language:
- URL:
- https://preview.aclanthology.org/corrections-2026-01/2025.lowresnlp-1.3/
- DOI:
- Cite (ACL):
- Sudhansu Bala Das, Leo Raphael Rodrigues, Tapas Kumar Mishra, and Bidyut Ku Patra. 2025. A thresholding method for Improving translation Quality for Indic MT task. In Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages, pages 12–20, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
- Cite (Informal):
- A thresholding method for Improving translation Quality for Indic MT task (Das et al., LowResNLP 2025)
- PDF:
- https://preview.aclanthology.org/corrections-2026-01/2025.lowresnlp-1.3.pdf