Lionel Delphin-Poulat
2025
PoSum-Bench: Benchmarking Position Bias in LLM-based Conversational Summarization
Xu Sun
|
Lionel Delphin-Poulat
|
Christèle Tarnec
|
Anastasia Shimorina
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) are increasingly used for zero-shot conversation summarization, but often exhibit positional bias—tending to overemphasize content from the beginning or end of a conversation while neglecting the middle. To address this issue, we introduce PoSum-Bench, a comprehensive benchmark for evaluating positional bias in conversational summarization, featuring diverse English and French conversational datasets spanning formal meetings, casual conversations, and customer service interactions. We propose a novel semantic similarity-based sentence-level metric to quantify the direction and magnitude of positional bias in model-generated summaries, enabling systematic and reference-free evaluation across conversation positions, languages, and conversational contexts.Our benchmark and methodology thus provide the first systematic, cross-lingual framework for reference-free evaluation of positional bias in conversational summarization, laying the groundwork for developing more balanced and unbiased summarization models.