Abstract
It is hard to evaluate the quality of the generated text by a generative dialogue system. Currently, dialogue evaluation relies on human judges to label the quality of the generated text. It is not a reusable mechanism that can give consistent evaluation for system developers. We believe that it is easier to get consistent results on comparing two generated dialogue by two systems and it is hard to give a consistent quality score on only one system at a time. In this paper, we propose a machine learning approach to reduce the effort of human evaluation by learning the human judgment on comparing two dialogue systems. Training from the human labeling result, the evaluation model learns which generative models is better in each dialog context. Thus, it can be used for system developers to compare the fine-tuned models over and over again without the human labor. In our experiment we find the agreement between the learned model and human judge is 70%. The experiment is conducted on comparing two attention based GRU-RNN generative models.- Anthology ID:
- 2020.lrec-1.198
- Volume:
- Proceedings of the Twelfth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2020
- Address:
- Marseille, France
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 1598–1602
- Language:
- English
- URL:
- https://aclanthology.org/2020.lrec-1.198
- DOI:
- Cite (ACL):
- Shih-Hung Wu and Sheng-Lun Chien. 2020. Learning the Human Judgment for the Automatic Evaluation of Chatbot. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1598–1602, Marseille, France. European Language Resources Association.
- Cite (Informal):
- Learning the Human Judgment for the Automatic Evaluation of Chatbot (Wu & Chien, LREC 2020)
- PDF:
- https://preview.aclanthology.org/auto-file-uploads/2020.lrec-1.198.pdf