@inproceedings{popovic-2019-evaluating,
    title = "Evaluating Conjunction Disambiguation on {E}nglish-to-{G}erman and {F}rench-to-{G}erman {WMT} 2019 Translation Hypotheses",
    author = "Popovi{\'c}, Maja",
    editor = "Bojar, Ond{\v{r}}ej  and
      Chatterjee, Rajen  and
      Federmann, Christian  and
      Fishel, Mark  and
      Graham, Yvette  and
      Haddow, Barry  and
      Huck, Matthias  and
      Yepes, Antonio Jimeno  and
      Koehn, Philipp  and
      Martins, Andr{\'e}  and
      Monz, Christof  and
      Negri, Matteo  and
      N{\'e}v{\'e}ol, Aur{\'e}lie  and
      Neves, Mariana  and
      Post, Matt  and
      Turchi, Marco  and
      Verspoor, Karin",
    booktitle = "Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)",
    month = aug,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://preview.aclanthology.org/iwcs-25-ingestion/W19-5353/",
    doi = "10.18653/v1/W19-5353",
    pages = "464--469",
    abstract = "We present a test set for evaluating an MT system{'}s capability to translate ambiguous conjunctions depending on the sentence structure. We concentrate on the English conjunction ``but'' and its French equivalent ``mais'' which can be translated into two different German conjunctions. We evaluate all English-to-German and French-to-German submissions to the WMT 2019 shared translation task. The evaluation is done mainly automatically, with additional fast manual inspection of unclear cases. All systems almost perfectly recognise the target conjunction ``aber'', whereas accuracies for the other target conjunction ``sondern'' range from 78{\%} to 97{\%}, and the errors are mostly caused by replacing it with the alternative conjunction ``aber''. The best performing system for both language pairs is a multilingual Transformer ``TartuNLP'' system trained on all WMT 2019 language pairs which use the Latin script, indicating that the multilingual approach is beneficial for conjunction disambiguation. As for other system features, such as using synthetic back-translated data, context-aware, hybrid, etc., no particular (dis)advantages can be observed. Qualitative manual inspection of translation hypotheses shown that highly ranked systems generally produce translations with high adequacy and fluency, meaning that these systems are not only capable of capturing the right conjunction whereas the rest of the translation hypothesis is poor. On the other hand, the low ranked systems generally exhibit lower fluency and poor adequacy."
}