Reasoning Gets Harder for LLMs Inside A Dialogue

Ivan Kartáč; Mateusz Lango; Ondřej Dušek

Reasoning Gets Harder for LLMs Inside A Dialogue

Ivan Kart\'a\v{c}, Mateusz Lango, Ondrej Dusek

Abstract

Large Language Models (LLMs) achieve strong performance on many reasoning benchmarks, yet these evaluations typically focus on isolated tasks that differ from real-world usage in task-oriented dialogue (TOD). In this setting, LLMs must perform reasoning inherently while generating text and adhering to instructions on role, format, and style. This mismatch raises concerns about whether benchmark performance accurately reflects models’ reasoning robustness in TOD setting.We investigate how framing reasoning tasks within TOD affects LLM performance by introducing a new dynamic benchmark covering eight travel-related tasks that require arithmetic, spatial, temporal, and commonsense reasoning. Each problem is presented in both isolated and dialogue-based variants, enabling controlled comparison while mitigating data contamination. Experiments on nine LLMs reveal a substantial and consistent performance gap between isolated and dialogue settings. Through ablations and qualitative analysis, we show that this gap is largely driven by the multi-turn nature of dialogue, with additional effects from role conditioning and tool-use requirements. Our results highlight the need to evaluate LLM reasoning in realistic interactive scenarios.

Anthology ID:: 2026.acl-long.560
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 12263–12303
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.560/
DOI:
Bibkey:
Cite (ACL):: Ivan Kart\'a\v{c}, Mateusz Lango, and Ondrej Dusek. 2026. Reasoning Gets Harder for LLMs Inside A Dialogue. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12263–12303, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Reasoning Gets Harder for LLMs Inside A Dialogue (Kart'a\v{c} et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.560.pdf
Checklist:: 2026.acl-long.560.checklist.pdf

PDF Cite Search Checklist Fix data