[CASPI] Causal-aware Safe Policy Improvement for Task-oriented Dialogue

Govardana Sachithanandam Ramachandran, Kazuma Hashimoto, Caiming Xiong


Abstract
The recent success of reinforcement learning (RL) in solving complex tasks is often attributed to its capacity to explore and exploit an environment. Sample efficiency is usually not an issue for tasks with cheap simulators to sample data online. On the other hand, Task-oriented Dialogues (ToD) are usually learnt from offline data collected using human demonstrations. Collecting diverse demonstrations and annotating them is expensive. Unfortunately, RL policy trained on off-policy data are prone to issues of bias and generalization, which are further exacerbated by stochasticity in human response and non-markovian nature of annotated belief state of a dialogue management system. To this end, we propose a batch-RL framework for ToD policy learning: Causal-aware Safe Policy Improvement (CASPI). CASPI includes a mechanism to learn fine-grained reward that captures intention behind human response and also offers guarantee on dialogue policy’s performance against a baseline. We demonstrate the effectiveness of this framework on end-to-end dialogue task of the Multiwoz2.0 dataset. The proposed method outperforms the current state of the art. Further more we demonstrate sample efficiency, where our method trained only on 20% of the data, are comparable to current state of the art method trained on 100% data on two out of there evaluation metrics.
Anthology ID:
2022.acl-long.8
Volume:
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
92–102
Language:
URL:
https://aclanthology.org/2022.acl-long.8
DOI:
10.18653/v1/2022.acl-long.8
Bibkey:
Cite (ACL):
Govardana Sachithanandam Ramachandran, Kazuma Hashimoto, and Caiming Xiong. 2022. [CASPI] Causal-aware Safe Policy Improvement for Task-oriented Dialogue. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 92–102, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
[CASPI] Causal-aware Safe Policy Improvement for Task-oriented Dialogue (Ramachandran et al., ACL 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/add_acl24_videos/2022.acl-long.8.pdf
Data
MultiWOZ