Debiasing Reward Models via Causally Motivated Inference-Time Intervention

Kazutoshi Shinoda, Kosuke Nishida, Kyosuke Nishida


Abstract
Reward models (RMs) play a central role in aligning large language models (LLMs) with human preferences. However, RMs are often sensitive to spurious features such as response length. Existing inference-time approaches for mitigating these biases typically focus exclusively on response length, resulting in performance trade-offs. In this paper, we propose causally motivated intervention for mitigating multiple types of biases in RMs at inference time. Our method first identifies neurons whose activations are strongly correlated with predefined bias attributes, and applies neuron-level intervention that suppresses these signals. We evaluate our method on RM benchmarks and observe reductions in sensitivity to spurious features across diverse bias types, without inducing performance trade-offs. Moreover, when used for preference annotation, small RMs (2B and 7B) with our method, which edits less than 2% of all the neurons in RMs, enable LLMs to improve alignment, achieving performance comparable to that of a state-of-the-art 70B RM on AlpacaEval and MT-Bench. Further analysis reveals that bias signals are primarily encoded by neurons in early layers, shedding light on the internal mechanisms of bias exploitation in RMs.
Anthology ID:
2026.acl-long.1029
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
22474–22489
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1029/
DOI:
Bibkey:
Cite (ACL):
Kazutoshi Shinoda, Kosuke Nishida, and Kyosuke Nishida. 2026. Debiasing Reward Models via Causally Motivated Inference-Time Intervention. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 22474–22489, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Debiasing Reward Models via Causally Motivated Inference-Time Intervention (Shinoda et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1029.pdf
Checklist:
 2026.acl-long.1029.checklist.pdf