Yemin Wang


2025

pdf bib
F²Bench: An Open-ended Fairness Evaluation Benchmark for LLMs with Factuality Considerations
Tian Lan | Jiang Li | Yemin Wang | Xu Liu | Xiangdong Su | Guanglai Gao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

With the growing adoption of large language models (LLMs) in NLP tasks, concerns about their fairness have intensified. Yet, most existing fairness benchmarks rely on closed-ended evaluation formats, which diverge from real-world open-ended interactions. These formats are prone to position bias and introduce a “minimum score” effect, where models can earn partial credit simply by guessing. Moreover, such benchmarks often overlook factuality considerations rooted in historical, social, physiological, and cultural contexts, and rarely account for intersectional biases. To address these limitations, we propose F²Bench: an open-ended fairness evaluation benchmark for LLMs that explicitly incorporates factuality considerations. F²Bench comprises 2,568 instances across 10 demographic groups and two open-ended tasks. By integrating text generation, multi-turn reasoning, and factual grounding, F²Bench aims to more accurately reflect the complexities of real-world model usage. We conduct a comprehensive evaluation of several LLMs across different series and parameter sizes. Our results reveal that all models exhibit varying degrees of fairness issues. We further compare open-ended and closed-ended evaluations, analyze model-specific disparities, and provide actionable recommendations for future model development. Our code and dataset are publicly available at https://github.com/VelikayaScarlet/F2Bench.