SJTU System Description for the WMT24 Low-Resource Languages of Spain Task

Tianxiang Hu, Haoxiang Sun, Ruize Gao, Jialong Tang, Pei Zhang, Baosong Yang, Rui Wang


Abstract
We participate in the translation task on Spanish to Aragonese, Spanish to Aranese and Spanish to Asturian. Initially, we conduct preliminary experiments to assess the basic translation capabilities of various models and evaluate the impact of fine-tuning with different data types. We then choose to fine-tune the Qwen2-0.5B model using a forward synthesized pseudo-corpus from the Apertium translation system to replicate its fundamental performance. Building on this distillation model, we explore three optimization strategies across the three language directions: (1) Assembling the provided FLORES+ dev sets into a 5-shot format translation training dataset and performing few-shot fine-tuning to enhance model performance. (2) Utilizing the FLORES+ dev sets as training data and applying the Contrastive Preference Optimization (CPO) strategy for further refinement. (3) Retrieving the 20 most similar translation examples from the FLORES+ dev sets using the BM25 algorithm and performing 20-shot translations with the Claude 3.5-sonnet model. After evaluating these strategies, we select the best-performing approach for each language pair as our submission result.
Anthology ID:
2024.wmt-1.92
Volume:
Proceedings of the Ninth Conference on Machine Translation
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
Venue:
WMT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
943–948
Language:
URL:
https://preview.aclanthology.org/build-pipeline-with-new-library/2024.wmt-1.92/
DOI:
10.18653/v1/2024.wmt-1.92
Bibkey:
Cite (ACL):
Tianxiang Hu, Haoxiang Sun, Ruize Gao, Jialong Tang, Pei Zhang, Baosong Yang, and Rui Wang. 2024. SJTU System Description for the WMT24 Low-Resource Languages of Spain Task. In Proceedings of the Ninth Conference on Machine Translation, pages 943–948, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
SJTU System Description for the WMT24 Low-Resource Languages of Spain Task (Hu et al., WMT 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/build-pipeline-with-new-library/2024.wmt-1.92.pdf