Abstract
We describe the AST submission for the CoCo4MT 2023 shared task. The aim of the task is to identify the best candidates for translation in a source data set with the aim to use the translated parallel data for fine-tuning the mBART-50 model. We experiment with three methods: scoring sentences based on n-gram coverage, using LaBSE to estimate semantic similarity and identify misalignments and mistranslations by comparing machine translated source sentences to corresponding manually translated segments in high-resource languages. We find that we obtain the best results by combining these three methods, using LaBSE and machine translation for filtering, and one of our n-gram scoring approaches for ordering sentences.- Anthology ID:
- 2023.mtsummit-coco4mt.5
- Volume:
- Proceedings of the Second Workshop on Corpus Generation and Corpus Augmentation for Machine Translation
- Month:
- September
- Year:
- 2023
- Address:
- Macau SAR, China
- Venue:
- MTSummit
- SIG:
- Publisher:
- Asia-Pacific Association for Machine Translation
- Note:
- Pages:
- 33–38
- Language:
- URL:
- https://aclanthology.org/2023.mtsummit-coco4mt.5
- DOI:
- Cite (ACL):
- Steinþór Steingrímsson. 2023. The AST Submission for the CoCo4MT 2023 Shared Task on Corpus Construction for Low-Resource Machine Translation. In Proceedings of the Second Workshop on Corpus Generation and Corpus Augmentation for Machine Translation, pages 33–38, Macau SAR, China. Asia-Pacific Association for Machine Translation.
- Cite (Informal):
- The AST Submission for the CoCo4MT 2023 Shared Task on Corpus Construction for Low-Resource Machine Translation (Steingrímsson, MTSummit 2023)
- PDF:
- https://preview.aclanthology.org/emnlp-22-attachments/2023.mtsummit-coco4mt.5.pdf