Abstract
We present the results of our constrained submission to the WMT 2024 shared task, which focuses on translating from Spanish into two low-resource languages of Spain: Aranese (spa-arn) and Aragonese (spa-arg). Our system integrates real and synthetic data generated by large language models (e.g., BLOOMZ) and rule-based Apertium translation systems. Built upon the pre-trained NLLB system, our translation model utilizes a multistage approach, progressively refining the initial model through the sequential use of different datasets, starting with large-scale synthetic or crawled data and advancing to smaller, high-quality parallel corpora. This approach resulted in BLEU scores of 30.1 for Spanish to Aranese and 61.9 for Spanish to Aragonese.- Anthology ID:
- 2024.wmt-1.82
- Volume:
- Proceedings of the Ninth Conference on Machine Translation
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
- Venue:
- WMT
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 862–870
- Language:
- URL:
- https://preview.aclanthology.org/add_missing_videos/2024.wmt-1.82/
- DOI:
- 10.18653/v1/2024.wmt-1.82
- Cite (ACL):
- Jonathan Mutal and Lucía Ormaechea. 2024. TIM-UNIGE Translation into Low-Resource Languages of Spain for WMT24. In Proceedings of the Ninth Conference on Machine Translation, pages 862–870, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- TIM-UNIGE Translation into Low-Resource Languages of Spain for WMT24 (Mutal & Ormaechea, WMT 2024)
- PDF:
- https://preview.aclanthology.org/add_missing_videos/2024.wmt-1.82.pdf