MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation

Chenghao Yang; Yinbo Luo; Zhoufutu Wen; Qi Chu; Tao Gong; Longxiang Liu; Kaiyuan Zhang; Jianpeng Jiao; Ge Zhang; Wenhao Huang; Nenghai Yu

doi:10.18653/v1/2025.findings-emnlp.314

MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation

Chenghao Yang, Yinbo Luo, Zhoufutu Wen, Qi Chu, Tao Gong, Longxiang Liu, Kaiyuan Zhang, Jianpeng Jiao, Ge Zhang, Wenhao Huang, Nenghai Yu

Abstract

Large Language Models (LLMs), e.g. ChatGPT, have been widely adopted in real-world dialogue applications. However, LLMs’ robustness, especially in handling long complex dialogue sessions, including frequent motivation transfer, sophisticated cross-turn dependency, is criticized all along. Nevertheless, no existing benchmarks can fully reflect these weaknesses. We present MARS-Bench, a Multi-turn Athletic Real-world Scenario Dialogue Benchmark, designed to remedy the gap. MARS-Bench is constructed from play-by-play text commentary so to feature realistic dialogues specifically designed to evaluate three critical aspects of multi-turn conversations: ultra multi-turn, interactive multi-turn, and cross-turn tasks. Extensive experiments on MARS-Bench also reveal that closed-source LLMs significantly outperform open-source alternatives, explicit reasoning significantly boosts LLMs’ robustness on handling long complex dialogue sessions, and LLMs indeed face significant challenge when handling motivation transfer and sophisticated cross-turn dependency. Moreover, we provide mechanistic interpretability on how attention sinks due to special tokens lead to LLMs’ performance degradation when handling long complex dialogue sessions based on attention visualization experiment in Qwen2.5-7B-Instruction.

Anthology ID:: 2025.findings-emnlp.314
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5872–5898
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.314/
DOI:: 10.18653/v1/2025.findings-emnlp.314
Bibkey:
Cite (ACL):: Chenghao Yang, Yinbo Luo, Zhoufutu Wen, Qi Chu, Tao Gong, Longxiang Liu, Kaiyuan Zhang, Jianpeng Jiao, Ge Zhang, Wenhao Huang, and Nenghai Yu. 2025. MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5872–5898, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation (Yang et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.314.pdf
Checklist:: 2025.findings-emnlp.314.checklist.pdf

PDF Cite Search Checklist Fix data