PIRA: Preference-Oriented Instruction-Tuned Reward Models with Dual Aggregation

Yongfu Xue

PIRA: Preference-Oriented Instruction-Tuned Reward Models with Dual Aggregation

Abstract

Reward models are pivotal for aligning Large Language Models (LLMs) with human preferences. Existing approaches face two key limitations: Discriminative reward models require large-scale annotated data, as they cannot exploit the preference instruction-following capability of LLMs available to generative reward models. Moreover, reward models are particularly prone to reward overoptimization, where LLMs exploit weaknesses in the reward function instead of improving true alignment. We introduce PIRA, a training paradigm that integrates three complementary strategies to address these challenges: (1) reformulating question–answer pairs into preference-task instructions to explicitly leverage LLMs’ preference instruction-following capability, (2) averaging the rewards aggregated from diverse preference-task instructions for each sample, which mitigates task-specific bias and enhances robustness across evaluation perspectives, and (3) averaging outputs from the value head under different dropout rates to stabilize reward estimation. Experiments on public datasets show that PIRA improves performance considerably, enhances generalization, and effectively mitigates reward overoptimization.

Anthology ID:: 2026.findings-eacl.117
Volume:: Findings of the Association for Computational Linguistics: EACL 2026
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2226–2234
Language:
URL:: https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.117/
DOI:
Bibkey:
Cite (ACL):: Yongfu Xue. 2026. PIRA: Preference-Oriented Instruction-Tuned Reward Models with Dual Aggregation. In Findings of the Association for Computational Linguistics: EACL 2026, pages 2226–2234, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: PIRA: Preference-Oriented Instruction-Tuned Reward Models with Dual Aggregation (Xue, Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.117.pdf
Checklist:: 2026.findings-eacl.117.checklist.pdf

PDF Cite Search Checklist Fix data