Shades of BLEU, Flavours of Success: The Case of MultiWOZ

Tomáš Nekvinda; Ondřej Dušek

doi:10.18653/v1/2021.gem-1.4

Shades of BLEU, Flavours of Success: The Case of MultiWOZ

Abstract

The MultiWOZ dataset (Budzianowski et al.,2018) is frequently used for benchmarkingcontext-to-response abilities of task-orienteddialogue systems. In this work, we identifyinconsistencies in data preprocessing and re-porting of three corpus-based metrics used onthis dataset, i.e., BLEU score and Inform &Success rates. We point out a few problemsof the MultiWOZ benchmark such as unsat-isfactory preprocessing, insufficient or under-specified evaluation metrics, or rigid database. We re-evaluate 7 end-to-end and 6 policy opti-mization models in as-fair-as-possible setups,and we show that their reported scores cannotbe directly compared. To facilitate compari-son of future systems, we release our stand-alone standardized evaluation scripts. We alsogive basic recommendations for corpus-basedbenchmarking in future works.

Anthology ID:: 2021.gem-1.4
Volume:: Proceedings of the First Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)
Month:: August
Year:: 2021
Address:: Online
Editors:: Antoine Bosselut, Esin Durmus, Varun Prashant Gangal, Sebastian Gehrmann, Yacine Jernite, Laura Perez-Beltrachini, Samira Shaikh, Wei Xu
Venue:: GEM
SIG:: SIGGEN
Publisher:: Association for Computational Linguistics
Note:
Pages:: 34–46
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2021.gem-1.4/
DOI:: 10.18653/v1/2021.gem-1.4
Bibkey:
Cite (ACL):: Tomáš Nekvinda and Ondřej Dušek. 2021. Shades of BLEU, Flavours of Success: The Case of MultiWOZ. In Proceedings of the First Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pages 34–46, Online. Association for Computational Linguistics.
Cite (Informal):: Shades of BLEU, Flavours of Success: The Case of MultiWOZ (Nekvinda & Dušek, GEM 2021)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2021.gem-1.4.pdf

PDF Cite Search Fix data