F²Bench: An Open-ended Fairness Evaluation Benchmark for LLMs with Factuality Considerations

Tian Lan, Jiang Li, Yemin Wang, Xu Liu, Xiangdong Su, Guanglai Gao


Abstract
With the growing adoption of large language models (LLMs) in NLP tasks, concerns about their fairness have intensified. Yet, most existing fairness benchmarks rely on closed-ended evaluation formats, which diverge from real-world open-ended interactions. These formats are prone to position bias and introduce a “minimum score” effect, where models can earn partial credit simply by guessing. Moreover, such benchmarks often overlook factuality considerations rooted in historical, social, physiological, and cultural contexts, and rarely account for intersectional biases. To address these limitations, we propose F²Bench: an open-ended fairness evaluation benchmark for LLMs that explicitly incorporates factuality considerations. F²Bench comprises 2,568 instances across 10 demographic groups and two open-ended tasks. By integrating text generation, multi-turn reasoning, and factual grounding, F²Bench aims to more accurately reflect the complexities of real-world model usage. We conduct a comprehensive evaluation of several LLMs across different series and parameter sizes. Our results reveal that all models exhibit varying degrees of fairness issues. We further compare open-ended and closed-ended evaluations, analyze model-specific disparities, and provide actionable recommendations for future model development. Our code and dataset are publicly available at https://github.com/VelikayaScarlet/F2Bench.
Anthology ID:
2025.emnlp-main.105
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2031–2046
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.105/
DOI:
Bibkey:
Cite (ACL):
Tian Lan, Jiang Li, Yemin Wang, Xu Liu, Xiangdong Su, and Guanglai Gao. 2025. F²Bench: An Open-ended Fairness Evaluation Benchmark for LLMs with Factuality Considerations. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2031–2046, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
F²Bench: An Open-ended Fairness Evaluation Benchmark for LLMs with Factuality Considerations (Lan et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.105.pdf
Checklist:
 2025.emnlp-main.105.checklist.pdf