Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback

Khanh Nguyen; Hal Daumé III; Jordan Boyd-Graber

doi:10.18653/v1/D17-1153

Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback

Khanh Nguyen, Hal Daumé III, Jordan Boyd-Graber

Abstract

Machine translation is a natural candidate problem for reinforcement learning from human feedback: users provide quick, dirty ratings on candidate translations to guide a system to improve. Yet, current neural machine translation training focuses on expensive human-generated reference translations. We describe a reinforcement learning algorithm that improves neural machine translation systems from simulated human feedback. Our algorithm combines the advantage actor-critic algorithm (Mnih et al., 2016) with the attention-based neural encoder-decoder architecture (Luong et al., 2015). This algorithm (a) is well-designed for problems with a large action space and delayed rewards, (b) effectively optimizes traditional corpus-level machine translation metrics, and (c) is robust to skewed, high-variance, granular feedback modeled after actual human behaviors.

Anthology ID:: D17-1153
Volume:: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
Month:: September
Year:: 2017
Address:: Copenhagen, Denmark
Editors:: Martha Palmer, Rebecca Hwa, Sebastian Riedel
Venue:: EMNLP
SIG:: SIGDAT
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1464–1474
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/D17-1153/
DOI:: 10.18653/v1/D17-1153
Bibkey:
Cite (ACL):: Khanh Nguyen, Hal Daumé III, and Jordan Boyd-Graber. 2017. Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1464–1474, Copenhagen, Denmark. Association for Computational Linguistics.
Cite (Informal):: Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback (Nguyen et al., EMNLP 2017)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/D17-1153.pdf
Attachment:: D17-1153.Attachment.zip
Code: khanhptnk/bandit-nmt

PDF Cite Search Code Attachment Fix data