A Preliminary Exploration of Phrase-Based SMT and Multi-BPE Segmentations through Concatenated Tokenised Corpora for Low-Resource Indian Languages

Saumitra Yadav; Manish Shrivastava

A Preliminary Exploration of Phrase-Based SMT and Multi-BPE Segmentations through Concatenated Tokenised Corpora for Low-Resource Indian Languages

Abstract

This paper describes our methodology and findings in building Machine Translation (MT) systems for submission to the WMT 2025 Shared Task on Low-Resource Indic Language Translation. Our primary aim was to evaluate the effectiveness of a phrase-based Statistical Machine Translation (SMT) system combined with a less common subword segmentation strategy for languages with very limited parallel data. We applied multiple Byte Pair Encoding (BPE) merge operations to the parallel corpora and concatenated the outputs to improve vocabulary coverage. We built systems for the English–Nyishi, English–Khasi, and English–Assamese language pairs. Although the approach showed potential as a data augmentation method, its performance in BLEU scores was not competitive with other shared task systems. This paper outlines our system architecture, data processing pipeline, and evaluation results, and provides an analysis of the challenges, positioning our work as an exploratory benchmark for future research in this area.

Anthology ID:: 2025.wmt-1.103
Volume:: Proceedings of the Tenth Conference on Machine Translation
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
Venue:: WMT
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1253–1258
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.wmt-1.103/
DOI:
Bibkey:
Cite (ACL):: Saumitra Yadav and Manish Shrivastava. 2025. A Preliminary Exploration of Phrase-Based SMT and Multi-BPE Segmentations through Concatenated Tokenised Corpora for Low-Resource Indian Languages. In Proceedings of the Tenth Conference on Machine Translation, pages 1253–1258, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: A Preliminary Exploration of Phrase-Based SMT and Multi-BPE Segmentations through Concatenated Tokenised Corpora for Low-Resource Indian Languages (Yadav & Shrivastava, WMT 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.wmt-1.103.pdf

PDF Cite Search Fix data