Mathieu Rita
2024
Countering Reward Over-Optimization in LLM with Demonstration-Guided Reinforcement Learning
Mathieu Rita
|
Florian Strub
|
Rahma Chaabouni
|
Paul Michel
|
Emmanuel Dupoux
|
Olivier Pietquin
Findings of the Association for Computational Linguistics ACL 2024
While reinforcement learning (RL) has been proven essential for tuning large language models (LLMs), it can lead to reward over-optimization (ROO). Existing approaches address ROO by adding KL regularization, requiring computationally expensive hyperparameter tuning. Additionally, KL regularization focuses solely on regularizing the language policy, neglecting a potential source of regularization: the reward function itself. Inspired by demonstration-guided RL, we here introduce the Reward Calibration from Demonstration (RCfD), which leverages human demonstrations and a reward model to recalibrate the reward objective. Formally, given a prompt, the RCfD objective minimizes the distance between the demonstrations’ and LLM’s rewards rather than directly maximizing the reward function. This objective shift avoids incentivizing the LLM to exploit the reward model and promotes more natural and diverse language generation.We show the effectiveness of RCfD in three RL language tasks, where it achieves comparable performance to carefully tuned baselines while mitigating ROO.
2023
On the Correspondence between Compositionality and Imitation in Emergent Neural Communication
Emily Cheng
|
Mathieu Rita
|
Thierry Poibeau
Findings of the Association for Computational Linguistics: ACL 2023
Compositionality is a hallmark of human language that not only enables linguistic generalization, but also potentially facilitates acquisition. When simulating language emergence with neural networks, compositionality has been shown to improve communication performance; however, its impact on imitation learning has yet to be investigated. Our work explores the link between compositionality and imitation in a Lewis game played by deep neural agents. Our contributions are twofold: first, we show that the learning algorithm used to imitate is crucial: supervised learning tends to produce more average languages, while reinforcement learning introduces a selection pressure toward more compositional languages. Second, our study reveals that compositional languages are easier to imitate, which may induce the pressure toward compositional languages in RL imitation settings.
2020
“LazImpa”: Lazy and Impatient neural agents learn to communicate efficiently
Mathieu Rita
|
Rahma Chaabouni
|
Emmanuel Dupoux
Proceedings of the 24th Conference on Computational Natural Language Learning
Previous work has shown that artificial neural agents naturally develop surprisingly non-efficient codes. This is illustrated by the fact that in a referential game involving a speaker and a listener neural networks optimizing accurate transmission over a discrete channel, the emergent messages fail to achieve an optimal length. Furthermore, frequent messages tend to be longer than infrequent ones, a pattern contrary to the Zipf Law of Abbreviation (ZLA) observed in all natural languages. Here, we show that near-optimal and ZLA-compatible messages can emerge, but only if both the speaker and the listener are modified. We hence introduce a new communication system, “LazImpa”, where the speaker is made increasingly lazy, i.e., avoids long messages, and the listener impatient, i.e., seeks to guess the intended content as soon as possible.
Search
Co-authors
- Rahma Chaabouni 2
- Emmanuel Dupoux 2
- Emily Cheng 1
- Thierry Poibeau 1
- Florian Strub 1
- show all...