Multi-Domain Explainability of Preferences

Nitay Calderon; Liat Ein Dor; Roi Reichart

Multi-Domain Explainability of Preferences

Nitay Calderon, Liat Ein-Dor, Roi Reichart

Abstract

Preference mechanisms, such as human preference, LLM-as-a-Judge (LaaJ), and reward models, are central to aligning and evaluating large language models (LLMs). Yet, the underlying concepts that drive these preferences remain poorly understood. In this work, we propose a fully automated method for generating local and global concept-based explanations of preferences across multiple domains. Our method utilizes an LLM to identify concepts (rubrics) that distinguish between chosen and rejected responses, and to represent them with concept-based vectors. To model the relationships between concepts and preferences, we propose a white-box Hierarchical Multi-Domain Regression model that captures both domain-general and domain-specific effects. To evaluate our method, we curate a dataset spanning eight diverse domains and explain twelve mechanisms. Our method achieves strong preference prediction performance, outperforming baselines while also being explainable. Additionally, we assess explanations in two application-driven settings. First, guiding LLM outputs with concepts from LaaJ explanations yields responses that those judges consistently prefer. Second, prompting LaaJs with concepts explaining humans improves their preference predictions. Together, our work establishes a new paradigm for explainability in the era of LLMs.

Anthology ID:: 2025.emnlp-main.736
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 14553–14586
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.736/
DOI:
Bibkey:
Cite (ACL):: Nitay Calderon, Liat Ein-Dor, and Roi Reichart. 2025. Multi-Domain Explainability of Preferences. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14553–14586, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Multi-Domain Explainability of Preferences (Calderon et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.736.pdf
Checklist:: 2025.emnlp-main.736.checklist.pdf

PDF Cite Search Checklist Fix data