RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation
Kaiqu Liang, Haimin Hu, Ryan Liu, Thomas L. Griffiths, Jaime Fern\'andez Fisac
Abstract
While Reinforcement Learning from Human Feedback (RLHF) has shown promise in aligning generative AI, we present empirical evidence that it can also cause severe, systematic misalignment. We hypothesize that this stems from evaluator feedback depending on downstream outcome predictions (foresight) that can be influenced by the AI’s output, inducing Goodhart’s law dynamics. We present a theoretical analysis showing that conditioning evaluator feedback on downstream observations (hindsight) inhibits this effect by decoupling the alignment signal from potentially compromised predictions—crucially, the result holds even if the observed outcomes are sampled from the AI’s own world model. Building on this insight, we introduce Reinforcement Learning from Hindsight Simulation (RLHS), which presents plausible simulated outcomes to evaluators before eliciting feedback. We validate RLHS across three consultancy settings—marketplace interactions, restaurant recommendations, and online course advising—using both online (PPO) and offline (DPO) fine-tuning methods, and show that it substantially improves alignment over RLHF in experiments and human evaluations. We perform post-hoc benchmark evaluations on TruthfulQA, HaluEval, and TrustLLM, finding that even after single-task fine-tuning, RLHF misalignment persists, while RLHS consistently outperforms baselines and demonstrates strong out-of-domain generalization.- Anthology ID:
- 2026.findings-acl.556
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 11457–11483
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.556/
- DOI:
- Cite (ACL):
- Kaiqu Liang, Haimin Hu, Ryan Liu, Thomas L. Griffiths, and Jaime Fern\'andez Fisac. 2026. RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation. In Findings of the Association for Computational Linguistics: ACL 2026, pages 11457–11483, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation (Liang et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.556.pdf