Kangzhen Liu
2025
TranssionMT’s Submission to the Indic MT Shared Task in WMT 2025
Zebiao Zhou
|
Hui Li
|
Xiangxun Zhu
|
Kangzhen Liu
Proceedings of the Tenth Conference on Machine Translation
This study addresses the low-resource Indian lan- 002guage translation task (English Assamese, English Ma- 003nipuri) at WMT 2025, proposing a cross-iterative back- 004translation and data augmentation approach based on 005dual pre-trained models to enhance translation perfor- 006mance in low-resource scenarios. The research method- 007ology primarily encompasses four aspects: (1) Utilizing 008open-source pre-trained models IndicTrans2_1B and 009NLLB_3.3B, fine-tuning them on official bilingual data, 010followed by alternating back-translation and incremen- 011tal training to generate high-quality pseudo-parallel cor- 012pora and optimize model parameters through multiple 013iterations; (2) Employing the open-source semantic sim- 014ilarity model (all-mpnet-base-v2) to filter monolingual 015sentences with low semantic similarity to the test set 016from open-source corpora such as NLLB and BPCC, 017thereby improving the relevance of monolingual data 018to the task; (3) Cleaning the training data, including 019removing URL and HTML format content, eliminating 020untranslated sentences in back-translation, standardiz- 021ing symbol formats, and normalizing capitalization of 022the first letter; (4) During the model inference phase, 023combining the outputs generated by the fine-tuned In- 024dicTrans2_1B and NLLB3.3B