Human evaluation of web-crawled parallel corpora for machine translation

Gema Ramírez-Sánchez, Marta Bañón, Jaume Zaragoza-Bernabeu, Sergio Ortiz Rojas


Abstract
Quality assessment has been an ongoing activity of the series of ParaCrawl efforts to crawl massive amounts of parallel data from multilingual websites for 29 languages. The goal of ParaCrawl is to get parallel data that is good for machine translation. To prove so, both, automatic (extrinsic) and human (intrinsic and extrinsic) evaluation tasks have been included as part of the quality assessment activity of the project. We sum up the various methods followed to address these evaluation tasks for the web-crawled corpora produced and their results. We review their advantages and disadvantages for the final goal of the ParaCrawl project and the related ongoing project MaCoCu.
Anthology ID:
2022.humeval-1.4
Volume:
Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval)
Month:
May
Year:
2022
Address:
Dublin, Ireland
Venue:
HumEval
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
32–41
Language:
URL:
https://aclanthology.org/2022.humeval-1.4
DOI:
10.18653/v1/2022.humeval-1.4
Bibkey:
Cite (ACL):
Gema Ramírez-Sánchez, Marta Bañón, Jaume Zaragoza-Bernabeu, and Sergio Ortiz Rojas. 2022. Human evaluation of web-crawled parallel corpora for machine translation. In Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval), pages 32–41, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Human evaluation of web-crawled parallel corpora for machine translation (Ramírez-Sánchez et al., HumEval 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/nodalida-main-page/2022.humeval-1.4.pdf
Video:
 https://preview.aclanthology.org/nodalida-main-page/2022.humeval-1.4.mp4
Data
ParaCrawl