Vivien Macketanz


A Linguistically Motivated Test Suite to Semi-Automatically Evaluate German–English Machine Translation Output
Vivien Macketanz | Eleftherios Avramidis | Aljoscha Burchardt | He Wang | Renlong Ai | Shushen Manakhimova | Ursula Strohriegel | Sebastian Möller | Hans Uszkoreit
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper presents a fine-grained test suite for the language pair German–English. The test suite is based on a number of linguistically motivated categories and phenomena and the semi-automatic evaluation is carried out with regular expressions. We describe the creation and implementation of the test suite in detail, providing a full list of all categories and phenomena. Furthermore, we present various exemplary applications of our test suite that have been implemented in the past years, like contributions to the Conference of Machine Translation, the usage of the test suite and MT outputs for quality estimation, and the expansion of the test suite to the language pair Portuguese–English. We describe how we tracked the development of the performance of various systems MT systems over the years with the help of the test suite and which categories and phenomena are prone to resulting in MT errors. For the first time, we also make a large part of our test suite publicly available to the research community.

Perceptual Quality Dimensions of Machine-Generated Text with a Focus on Machine Translation
Vivien Macketanz | Babak Naderi | Steven Schmidt | Sebastian Möller
Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval)

The quality of machine-generated text is a complex construct consisting of various aspects and dimensions. We present a study that aims to uncover relevant perceptual quality dimensions for one type of machine-generated text, that is, Machine Translation. We conducted a crowdsourcing survey in the style of a Semantic Differential to collect attribute ratings for German MT outputs. An Exploratory Factor Analysis revealed the underlying perceptual dimensions. As a result, we extracted four factors that operate as relevant dimensions for the Quality of Experience of MT outputs: precision, complexity, grammaticality, and transparency.

Linguistically Motivated Evaluation of the 2022 State-of-the-art Machine Translation Systems for Three Language Directions
Vivien Macketanz | Shushen Manakhimova | Eleftherios Avramidis | Ekaterina Lapshinova-koltunski | Sergei Bagdasarov | Sebastian Möller
Proceedings of the Seventh Conference on Machine Translation (WMT)

This document describes a fine-grained linguistically motivated analysis of 29 machine translation systems submitted at the Shared Task of the 7th Conference of Machine Translation (WMT22). This submission expands the test suite work of previous years by adding the language direction of English–Russian. As a result, evaluation takes place for the language directions of German–English, English–German, and English–Russian. We find that the German–English systems suffer in translating idioms, some tenses of modal verbs, and resultative predicates, the English–German ones in idioms, transitive-past progressive, and middle voice, whereas the English–Russian ones in pseudogapping and idioms.

Linguistically Motivated Evaluation of Machine Translation Metrics Based on a Challenge Set
Eleftherios Avramidis | Vivien Macketanz
Proceedings of the Seventh Conference on Machine Translation (WMT)

We employ a linguistically motivated challenge set in order to evaluate the state-of-the-art machine translation metrics submitted to the Metrics Shared Task of the 7th Conference for Machine Translation. The challenge set includes about 20,000 items extracted from 145 MT systems for two language directions (German-English, English-German), covering more than 100 linguistically-motivated phenomena organized in 14 categories. The best performing metrics are YiSi-1, BERTScore and COMET-22 for German-English, and UniTE, UniTE-ref, XL-DA and xxl-DA19 for English-German.Metrics in both directions are performing worst when it comes to named-entities & terminology and particularly measuring units. Particularly in German-English they are weak at detecting issues at punctuation, polar questions, relative clauses, dates and idioms. In English-German, they perform worst at present progressive of transitive verbs, future II progressive of intransitive verbs, simple present perfect of ditransitive verbs and focus particles.


Observing the Learning Curve of NMT Systems With Regard to Linguistic Phenomena
Patrick Stadler | Vivien Macketanz | Eleftherios Avramidis
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop

In this paper we present our observations and evaluations by observing the linguistic performance of the system on several steps on the training process of various English-to-German Neural Machine Translation models. The linguistic performance is measured through a semi-automatic process using a test suite. Among several linguistic observations, we find that the translation quality of some linguistic categories decreased within the recorded iterations. Additionally, we notice some drops of the translation quality of certain categories when using a larger corpus.

Linguistic Evaluation for the 2021 State-of-the-art Machine Translation Systems for German to English and English to German
Vivien Macketanz | Eleftherios Avramidis | Shushen Manakhimova | Sebastian Möller
Proceedings of the Sixth Conference on Machine Translation

We are using a semi-automated test suite in order to provide a fine-grained linguistic evaluation for state-of-the-art machine translation systems. The evaluation includes 18 German to English and 18 English to German systems, submitted to the Translation Shared Task of the 2021 Conference on Machine Translation. Our submission adds up to the submissions of the previous years by creating and applying a wide-range test suite for English to German as a new language pair. The fine-grained evaluation allows spotting significant differences between systems that cannot be distinguished by the direct assessment of the human evaluation campaign. We find that most of the systems achieve good accuracies in the majority of linguistic phenomena but there are few phenomena with lower accuracy, such as the idioms, the modal pluperfect and the German resultative predicates. Two systems have significantly better test suite accuracy in macro-average in every language direction, Online-W and Facebook-AI for German to English and VolcTrans and Online-W for English to German. The systems show a steady improvement as compared to previous years.


Fine-grained linguistic evaluation for state-of-the-art Machine Translation
Eleftherios Avramidis | Vivien Macketanz | Ursula Strohriegel | Aljoscha Burchardt | Sebastian Möller
Proceedings of the Fifth Conference on Machine Translation

This paper describes a test suite submission providing detailed statistics of linguistic performance for the state-of-the-art German-English systems of the Fifth Conference of Machine Translation (WMT20). The analysis covers 107 phenomena organized in 14 categories based on about 5,500 test items, including a manual annotation effort of 45 person hours. Two systems (Tohoku and Huoshan) appear to have significantly better test suite accuracy than the others, although the best system of WMT20 is not significantly better than the one from WMT19 in a macro-average. Additionally, we identify some linguistic phenomena where all systems suffer (such as idioms, resultative predicates and pluperfect), but we are also able to identify particular weaknesses for individual systems (such as quotation marks, lexical ambiguity and sluicing). Most of the systems of WMT19 which submitted new versions this year show improvements.


Train, Sort, Explain: Learning to Diagnose Translation Models
Robert Schwarzenberg | David Harbecke | Vivien Macketanz | Eleftherios Avramidis | Sebastian Möller
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)

Evaluating translation models is a trade-off between effort and detail. On the one end of the spectrum there are automatic count-based methods such as BLEU, on the other end linguistic evaluations by humans, which arguably are more informative but also require a disproportionately high effort. To narrow the spectrum, we propose a general approach on how to automatically expose systematic differences between human and machine translations to human experts. Inspired by adversarial settings, we train a neural text classifier to distinguish human from machine translations. A classifier that performs and generalizes well after training should recognize systematic differences between the two classes, which we uncover with neural explainability methods. Our proof-of-concept implementation, DiaMaT, is open source. Applied to a dataset translated by a state-of-the-art neural Transformer model, DiaMaT achieves a classification accuracy of 75% and exposes meaningful differences between humans and the Transformer, amidst the current discussion about human parity.

Linguistic Evaluation of German-English Machine Translation Using a Test Suite
Eleftherios Avramidis | Vivien Macketanz | Ursula Strohriegel | Hans Uszkoreit
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

We present the results of the application of a grammatical test suite for German-to-English MT on the systems submitted at WMT19, with a detailed analysis for 107 phenomena organized in 14 categories. The systems still translate wrong one out of four test items in average. Low performance is indicated for idioms, modals, pseudo-clefts, multi-word expressions and verb valency. When compared to last year, there has been a improvement of function words, non verbal agreement and punctuation. More detailed conclusions about particular systems and phenomena are also presented.


TQ-AutoTest – An Automated Test Suite for (Machine) Translation Quality
Vivien Macketanz | Renlong Ai | Aljoscha Burchardt | Hans Uszkoreit
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Fine-grained evaluation of Quality Estimation for Machine translation based on a linguistically motivated Test Suite
Eleftherios Avramidis | Vivien Macketanz | Arle Lommel | Hans Uszkoreit
Proceedings of the AMTA 2018 Workshop on Translation Quality Estimation and Automatic Post-Editing

Fine-grained evaluation of German-English Machine Translation based on a Test Suite
Vivien Macketanz | Eleftherios Avramidis | Aljoscha Burchardt | Hans Uszkoreit
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

We present an analysis of 16 state-of-the-art MT systems on German-English based on a linguistically-motivated test suite. The test suite has been devised manually by a team of language professionals in order to cover a broad variety of linguistic phenomena that MT often fails to translate properly. It contains 5,000 test sentences covering 106 linguistic phenomena in 14 categories, with an increased focus on verb tenses, aspects and moods. The MT outputs are evaluated in a semi-automatic way through regular expressions that focus only on the part of the sentence that is relevant to each phenomenon. Through our analysis, we are able to compare systems based on their performance on these categories. Additionally, we reveal strengths and weaknesses of particular systems and we identify grammatical phenomena where the overall performance of MT is relatively low.


pdf bib
CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
Daniel Zeman | Martin Popel | Milan Straka | Jan Hajič | Joakim Nivre | Filip Ginter | Juhani Luotolahti | Sampo Pyysalo | Slav Petrov | Martin Potthast | Francis Tyers | Elena Badmaeva | Memduh Gokirmak | Anna Nedoluzhko | Silvie Cinková | Jan Hajič jr. | Jaroslava Hlaváčová | Václava Kettnerová | Zdeňka Urešová | Jenna Kanerva | Stina Ojala | Anna Missilä | Christopher D. Manning | Sebastian Schuster | Siva Reddy | Dima Taji | Nizar Habash | Herman Leung | Marie-Catherine de Marneffe | Manuela Sanguinetti | Maria Simi | Hiroshi Kanayama | Valeria de Paiva | Kira Droganova | Héctor Martínez Alonso | Çağrı Çöltekin | Umut Sulubacak | Hans Uszkoreit | Vivien Macketanz | Aljoscha Burchardt | Kim Harris | Katrin Marheinecke | Georg Rehm | Tolga Kayadelen | Mohammed Attia | Ali Elkahky | Zhuoran Yu | Emily Pitler | Saran Lertpradit | Michael Mandl | Jesse Kirchner | Hector Fernandez Alcalde | Jana Strnadová | Esha Banerjee | Ruli Manurung | Antonio Stella | Atsuko Shimada | Sookyoung Kwak | Gustavo Mendonça | Tatiana Lando | Rattima Nitisaroj | Josie Li
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets. In 2017, the task was devoted to learning dependency parsers for a large number of languages, in a real-world setting without any gold-standard annotation on input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. In this paper, we define the task and evaluation methodology, describe how the data sets were prepared, report and analyze the main results, and provide a brief categorization of the different approaches of the participating systems.


DFKI’s system for WMT16 IT-domain task, including analysis of systematic errors
Eleftherios Avramidis | Aljoscha Burchardt | Vivien Macketanz | Ankit Srivastava
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

Deeper Machine Translation and Evaluation for German
Eleftherios Avramidis | Vivien Macketanz | Aljoscha Burchardt | Jindrich Helcl | Hans Uszkoreit
Proceedings of the 2nd Deep Machine Translation Workshop