Multi-Constraint State Tracking with Negation: A Diagnostic Benchmark for LLM World Modeling

Ayan Sar; Pranav Singh Puri; Sumit Aich; Anurag Kaushish; Tanupriya Choudhury; Ajith Abraham

Multi-Constraint State Tracking with Negation: A Diagnostic Benchmark for LLM World Modeling

Ayan Sar, Pranav Singh Puri, Sumit Aich, Anurag Kaushish, Tanupriya Choudhury, Ajith Abraham

Abstract

Large Language Models (LLMs) achieve strong performance on a wide range of reasoning benchmarks, yet it remains unclear whether they can reliably maintain and update internal representations of an evolving world described in natural language. In particular, existing evaluations inadequately probe state tracking under multiple interacting constraints and largely overlook the role of negated actions, despite their ubiquity in real-world language. We address this gap by introducing MCST, a diagnostic benchmark for multi-constraint state tracking that evaluates an LLM’s ability to maintain consistent world models across sequences of actions involving inventory changes, spatial movement, temporal ordering, and systematic negation. MCST comprises 100,847 questions spanning 12 real-world domains, with five calibrated difficulty levels, nine question types, and controlled integration of negated actions. The benchmark further incorporates culturally diverse entity names to enable analysis of cross-cultural robustness. We evaluate 14 SOTA LLMs across multiple model families using a unified evaluation protocol. Our results reveal substantial limitations: even the strongest models exhibit sharp performance degradation as difficulty increases, with accuracy dropping below 35% at the highest level. Most notably, we identify negation as a dominant failure mode, causing accuracy reductions of 23-32% across models. We release MCST and the full evaluation framework to support future research on state tracking and reasoning in language models and is available at GitHub.

Anthology ID:: 2026.acl-srw.119
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Santosh T.Y.S.S., Juan Diego Rodriguez, Ona de Gibert
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1317–1350
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-srw.119/
DOI:
Bibkey:
Cite (ACL):: Ayan Sar, Pranav Singh Puri, Sumit Aich, Anurag Kaushish, Tanupriya Choudhury, and Ajith Abraham. 2026. Multi-Constraint State Tracking with Negation: A Diagnostic Benchmark for LLM World Modeling. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), pages 1317–1350, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Multi-Constraint State Tracking with Negation: A Diagnostic Benchmark for LLM World Modeling (Sar et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-srw.119.pdf

PDF Cite Search Fix data