RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

Kaiqu Liang, Haimin Hu, Ryan Liu, Thomas L. Griffiths, Jaime Fern\'andez Fisac


Abstract
While Reinforcement Learning from Human Feedback (RLHF) has shown promise in aligning generative AI, we present empirical evidence that it can also cause severe, systematic misalignment. We hypothesize that this stems from evaluator feedback depending on downstream outcome predictions (foresight) that can be influenced by the AI’s output, inducing Goodhart’s law dynamics. We present a theoretical analysis showing that conditioning evaluator feedback on downstream observations (hindsight) inhibits this effect by decoupling the alignment signal from potentially compromised predictions—crucially, the result holds even if the observed outcomes are sampled from the AI’s own world model. Building on this insight, we introduce Reinforcement Learning from Hindsight Simulation (RLHS), which presents plausible simulated outcomes to evaluators before eliciting feedback. We validate RLHS across three consultancy settings—marketplace interactions, restaurant recommendations, and online course advising—using both online (PPO) and offline (DPO) fine-tuning methods, and show that it substantially improves alignment over RLHF in experiments and human evaluations. We perform post-hoc benchmark evaluations on TruthfulQA, HaluEval, and TrustLLM, finding that even after single-task fine-tuning, RLHF misalignment persists, while RLHS consistently outperforms baselines and demonstrates strong out-of-domain generalization.
Anthology ID:
2026.findings-acl.556
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11457–11483
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.556/
DOI:
Bibkey:
Cite (ACL):
Kaiqu Liang, Haimin Hu, Ryan Liu, Thomas L. Griffiths, and Jaime Fern\'andez Fisac. 2026. RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation. In Findings of the Association for Computational Linguistics: ACL 2026, pages 11457–11483, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation (Liang et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.556.pdf
Checklist:
 2026.findings-acl.556.checklist.pdf