A Post-Editing Dataset in the Legal Domain: Do we Underestimate Neural Machine Translation Quality?

Julia Ive; Lucia Specia; Sara Szoc; Tom Vanallemeersch; Joachim Van Den Bogaert; Eduardo Farah; Christine Maroti; Artur Ventura; Maxim Khalilov

A Post-Editing Dataset in the Legal Domain: Do we Underestimate Neural Machine Translation Quality?

Julia Ive, Lucia Specia, Sara Szoc, Tom Vanallemeersch, Joachim Van den Bogaert, Eduardo Farah, Christine Maroti, Artur Ventura, Maxim Khalilov

Abstract

We introduce a machine translation dataset for three pairs of languages in the legal domain with post-edited high-quality neural machine translation and independent human references. The data was collected as part of the EU APE-QUEST project and comprises crawled content from EU websites with translation from English into three European languages: Dutch, French and Portuguese. Altogether, the data consists of around 31K tuples including a source sentence, the respective machine translation by a neural machine translation system, a post-edited version of such translation by a professional translator, and - where available - the original reference translation crawled from parallel language websites. We describe the data collection process, provide an analysis of the resulting post-edits and benchmark the data using state-of-the-art quality estimation and automatic post-editing models. One interesting by-product of our post-editing analysis suggests that neural systems built with publicly available general domain data can provide high-quality translations, even though comparison to human references suggests that this quality is quite low. This makes our dataset a suitable candidate to test evaluation metrics. The data is freely available as an ELRC-SHARE resource.

Anthology ID:: 2020.lrec-1.455
Volume:: Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:: May
Year:: 2020
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 3692–3697
Language:: English
URL:: https://aclanthology.org/2020.lrec-1.455
DOI:
Bibkey:
Cite (ACL):: Julia Ive, Lucia Specia, Sara Szoc, Tom Vanallemeersch, Joachim Van den Bogaert, Eduardo Farah, Christine Maroti, Artur Ventura, and Maxim Khalilov. 2020. A Post-Editing Dataset in the Legal Domain: Do we Underestimate Neural Machine Translation Quality?. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3692–3697, Marseille, France. European Language Resources Association.
Cite (Informal):: A Post-Editing Dataset in the Legal Domain: Do we Underestimate Neural Machine Translation Quality? (Ive et al., LREC 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/emnlp22-frontmatter/2020.lrec-1.455.pdf

PDF Search