NeedleChain: Measuring Intact Context Comprehension Capability of Large Language Models

Hyeonseok Moon; Heui-Seok Lim

NeedleChain: Measuring Intact Context Comprehension Capability of Large Language Models

Abstract

Recent reports suggest that LLMs can handle increasingly long contexts. However, many existing benchmarks for context understanding embed substantial query-irrelevant content, which shifts evaluation toward retrieving relevant snippets rather than fully integrating all provided information. Under this setting, we view that current benchmarks can overestimate true context-understanding ability of LLMs. In particular, we demonstrate that when the context consists entirely of query-relevant text, even advanced models such as GPT-4o fail to reliably integrate inputs as short as 200 tokens. To evaluate this capability more rigorously, we introduce NeedleChain, a benchmark designed to test whether models can faithfully incorporate all given evidence. NeedleChain includes three variants that differ in the required order of comprehension, along with a parallel benchmark based on the needle-in-a-haystack(NIAH) paradigm. By comparing these variants, NeedleChain enables a more comprehensive assessment of context understanding. We further propose a training-free strategy that encourages models to reflect all available information, ROPE contraction, highlighting the importance of full-context integration and pointing to new directions for improving reliable reasoning over context.

Anthology ID:: 2026.findings-acl.1637
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 32718–32730
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1637/
DOI:
Bibkey:
Cite (ACL):: Hyeonseok Moon and Heuiseok Lim. 2026. NeedleChain: Measuring Intact Context Comprehension Capability of Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2026, pages 32718–32730, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: NeedleChain: Measuring Intact Context Comprehension Capability of Large Language Models (Moon & Lim, Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1637.pdf
Checklist:: 2026.findings-acl.1637.checklist.pdf

PDF Cite Search Checklist Fix data