Leo Raphael Rodrigues
2025
A thresholding method for Improving translation Quality for Indic MT task
Sudhansu Bala Das
|
Leo Raphael Rodrigues
|
Tapas Kumar Mishra
|
Bidyut Ku Patra
Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages
The conversion of content from one language to another using a computer system is known as Machine Translation (MT). Various techniques have been used to ensure effective translations that retain the contextual and lexical interpretation of the source and target languages. One of these methods is end-to-end Neural Machine Translation (NMT), which is frequently utilized in real-world machine translation systems. NMT requires large parallel datasets for effective translation. These datasets are essential for an MT system to acquire during the training phase to learn the linguistic patterns and structures of both languages. One such dataset is Samanantar, the largest publicly accessible parallel dataset for Indian languages (ILs). Since these datasets have been gathered from various sources, they contain many incorrect or dissimilar translations. Hence, the MT systems built using this dataset cannot perform to their usual potential. This paper proposes an algorithm to remove dissimilar translations from the training dataset and evaluate the model’s efficiency. Two Indic languages (ILs), Hindi (HIN) and Odia (ODI), were chosen for the experiment. A baseline NMT system is built for these languages, and the effect of different dataset sizes is investigated. The quality of the translations in the experiment is evaluated using standard metrics. The results have shown that removing the dissimilar translations from the training dataset improves the quality of the language. It is also noticed that, despite the fact that the ILs-English and English-ILs systems are trained using the same dataset, ILs-English works more effectively across all the evaluation metrics.