Abstract
The MultiWOZ dataset (Budzianowski et al.,2018) is frequently used for benchmarkingcontext-to-response abilities of task-orienteddialogue systems. In this work, we identifyinconsistencies in data preprocessing and re-porting of three corpus-based metrics used onthis dataset, i.e., BLEU score and Inform &Success rates. We point out a few problemsof the MultiWOZ benchmark such as unsat-isfactory preprocessing, insufficient or under-specified evaluation metrics, or rigid database. We re-evaluate 7 end-to-end and 6 policy opti-mization models in as-fair-as-possible setups,and we show that their reported scores cannotbe directly compared. To facilitate compari-son of future systems, we release our stand-alone standardized evaluation scripts. We alsogive basic recommendations for corpus-basedbenchmarking in future works.- Anthology ID:
- 2021.gem-1.4
- Volume:
- Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021)
- Month:
- August
- Year:
- 2021
- Address:
- Online
- Editors:
- Antoine Bosselut, Esin Durmus, Varun Prashant Gangal, Sebastian Gehrmann, Yacine Jernite, Laura Perez-Beltrachini, Samira Shaikh, Wei Xu
- Venue:
- GEM
- SIG:
- SIGGEN
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 34–46
- Language:
- URL:
- https://aclanthology.org/2021.gem-1.4
- DOI:
- 10.18653/v1/2021.gem-1.4
- Cite (ACL):
- Tomáš Nekvinda and Ondřej Dušek. 2021. Shades of BLEU, Flavours of Success: The Case of MultiWOZ. In Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021), pages 34–46, Online. Association for Computational Linguistics.
- Cite (Informal):
- Shades of BLEU, Flavours of Success: The Case of MultiWOZ (Nekvinda & Dušek, GEM 2021)
- PDF:
- https://preview.aclanthology.org/emnlp22-frontmatter/2021.gem-1.4.pdf
- Code
- Tomiinek/MultiWOZ_Evaluation
- Data
- MultiWOZ