Learning Efficient Dialogue Policy from Demonstrations through Shaping

Huimin Wang, Baolin Peng, Kam-Fai Wong


Abstract
Training a task-oriented dialogue agent with reinforcement learning is prohibitively expensive since it requires a large volume of interactions with users. Human demonstrations can be used to accelerate learning progress. However, how to effectively leverage demonstrations to learn dialogue policy remains less explored. In this paper, we present Sˆ2Agent that efficiently learns dialogue policy from demonstrations through policy shaping and reward shaping. We use an imitation model to distill knowledge from demonstrations, based on which policy shaping estimates feedback on how the agent should act in policy space. Reward shaping is then incorporated to bonus state-actions similar to demonstrations explicitly in value space encouraging better exploration. The effectiveness of the proposed Sˆ2Agentt is demonstrated in three dialogue domains and a challenging domain adaptation task with both user simulator evaluation and human evaluation.
Anthology ID:
2020.acl-main.566
Volume:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2020
Address:
Online
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6355–6365
Language:
URL:
https://aclanthology.org/2020.acl-main.566
DOI:
10.18653/v1/2020.acl-main.566
Bibkey:
Cite (ACL):
Huimin Wang, Baolin Peng, and Kam-Fai Wong. 2020. Learning Efficient Dialogue Policy from Demonstrations through Shaping. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6355–6365, Online. Association for Computational Linguistics.
Cite (Informal):
Learning Efficient Dialogue Policy from Demonstrations through Shaping (Wang et al., ACL 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/paclic-22-ingestion/2020.acl-main.566.pdf
Video:
 http://slideslive.com/38928995