MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation

Xiaoyuan Li; Keqin Bao; Yubo Ma; Moxin Li; Wenjie Wang; Rui Men; Yichang Zhang; Fuli Feng; Dayiheng Liu

MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation

Xiaoyuan Li, Keqin Bao, Yubo Ma, Moxin Li, Wenjie Wang, Rui Men, Yichang Zhang, Fuli Feng, Dayiheng Liu

Abstract

Recent advances in Large Language Models (LLMs) have shown promising results in complex reasoning tasks. However, current evaluations predominantly focus on single-turn reasoning scenarios, leaving interactive tasks largely unexplored. We attribute it to the absence of comprehensive datasets and scalable automatic evaluation protocols. To fill these gaps, we present MTR-Bench for LLMs’ Multi-Turn Reasoning evaluation. Comprising 4 classes, 40 tasks and 3600 instances, MTR-Bench covers diverse reasoning capabilities, fine-grained difficulty granularity, and necessitates multi-turn interactions with the environments. Moreover, MTR-Bench features fully-automated framework spanning both dataset constructions and model evaluations, which enables scalable assessment without human interventions. Extensive experiments reveal that even the cutting-edge reasoning models fall short of multi-turn, interactive reasoning tasks. And the further analysis upon these results brings valuable insights for future research in interactive AI systems.

Anthology ID:: 2026.acl-long.984
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 21525–21577
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.984/
DOI:
Bibkey:
Cite (ACL):: Xiaoyuan Li, Keqin Bao, Yubo Ma, Moxin Li, Wenjie Wang, Rui Men, Yichang Zhang, Fuli Feng, and Dayiheng Liu. 2026. MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21525–21577, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation (Li et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.984.pdf
Checklist:: 2026.acl-long.984.checklist.pdf

PDF Cite Search Checklist Fix data