Abstract
This paper presents the Seed-CAT submission to the WMT24 Open Language Data Initiative shared task. We detail our data collection method, which involves a computer-aided translation tool developed explicitly for translating Seed corpora. We release a professionally translated Spanish corpus and a provenance dataset documenting the translation process. The quality of the data was validated on the FLORES+ benchmark with English-Spanish neural machine translation models, achieving an average chrF++ score of 34.9.- Anthology ID:
- 2024.wmt-1.50
- Volume:
- Proceedings of the Ninth Conference on Machine Translation
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
- Venue:
- WMT
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 624–635
- Language:
- URL:
- https://aclanthology.org/2024.wmt-1.50
- DOI:
- 10.18653/v1/2024.wmt-1.50
- Cite (ACL):
- Jose Cols. 2024. Spanish Corpus and Provenance with Computer-Aided Translation for the WMT24 OLDI Shared Task. In Proceedings of the Ninth Conference on Machine Translation, pages 624–635, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- Spanish Corpus and Provenance with Computer-Aided Translation for the WMT24 OLDI Shared Task (Cols, WMT 2024)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2024.wmt-1.50.pdf