Tianning Chai
2026
Activation Reward Models for Few-Shot Model Alignment
Tianning Chai | Chancharik Mitra | Brandon Huang | Gautam Rajendrakumar Gare | Zhiqiu Lin | Assaf Arbelle | Leonid Karlinsky | Rogerio Feris | Trevor Darrell | Deva Ramanan | Roei Herzig
Findings of the Association for Computational Linguistics: ACL 2026
Tianning Chai | Chancharik Mitra | Brandon Huang | Gautam Rajendrakumar Gare | Zhiqiu Lin | Assaf Arbelle | Leonid Karlinsky | Rogerio Feris | Trevor Darrell | Deva Ramanan | Roei Herzig
Findings of the Association for Computational Linguistics: ACL 2026
Aligning Large Language Models (LLMs) and Large Multimodal Models (LMMs) to human preferences is crucial for improving their real-world behavior. A common approach is to use reward models that enable reinforcement-learning post-training. However, traditional reward modeling requires finetuning on large preference datasets, limiting adaptability to new preferences. We introduce Activation Reward Models (Activation RMs)—the first mechanistic interpretability approach that steers LLM activations to align with few-shot preference data without finetuning. Our method combines activation denoising and output token likelihood scoring, achieving state-of-the-art performance on standard reward modeling benchmarks, surpassing zero-shot, few-shot, and voting-based baselines. We further demonstrate that Activation RMs mitigate reward hacking behaviors and remain robust to noisy exemplars and spurious reward signals. To evaluate this, we propose PreferenceHack, a novel few-shot benchmark testing reward models on reward hacking in a paired preference format, where Activation RMs achieve state-of-the-art performance, surpassing GPT-4o.