Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations

Jiho Jin, Woosung Kang, Junho Myung, Alice Oh


Abstract
Measuring social bias in large language models (LLMs) is crucial, but existing bias evaluation methods struggle to assess bias in long-form generation. We propose a Bias Benchmark for Generation (BBG), an adaptation of the Bias Benchmark for QA (BBQ), designed to evaluate social bias in long-form generation by having LLMs generate continuations of story prompts. Building our benchmark in English and Korean, we measure the probability of neutral and biased generations across ten LLMs. We also compare our long-form story generation evaluation results with multiple-choice BBQ evaluation, showing that the two approaches produce inconsistent results.
Anthology ID:
2025.findings-acl.585
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11215–11228
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.findings-acl.585/
DOI:
Bibkey:
Cite (ACL):
Jiho Jin, Woosung Kang, Junho Myung, and Alice Oh. 2025. Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations. In Findings of the Association for Computational Linguistics: ACL 2025, pages 11215–11228, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations (Jin et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.findings-acl.585.pdf