Can LLMs reason over extended multilingual contexts? Towards long-context evaluation beyond retrieval over haystacks

Amey Hengle, Prasoon Bajpai, Soham Dan, Tanmoy Chakraborty


Abstract
Existing multilingual long-context benchmarks, often based on the popular needle-in-a-haystack test, primarily evaluate a model’s ability to locate specific information buried within irrelevant texts. However, such a retrieval-centric approach is myopic and inherently limited, as successful recall alone does not indicate a model’s capacity to reason over extended contexts. Moreover, these benchmarks are susceptible to data leakage, short-circuiting, and risk making the evaluation a priori identifiable. To address these limitations, we introduce MLRBench, a new synthetic benchmark for multilingual long-context reasoning. Unlike existing benchmarks, MLRBench goes beyond surface-level retrieval by including bAbI-style tasks that test multi-hop inference, aggregation, and epistemic reasoning. Spanning seven languages, we design MLRBench to be parallel, resistant to leakage, and scalable to arbitrary context lengths. Our extensive experiments with an open-weight large language model (LLM) reveal a pronounced gap between high- and low-resource languages, particularly for tasks requiring the model to aggregate multiple facts or predict the absence of information. We also find that, in multilingual settings, LLMs effectively utilize less than 30% of their claimed context length. Although off-the-shelf Retrieval-Augmented Generation helps alleviate this to a certain extent, it does not solve the long-context problem. We open-source MLRBench to enable future research in the improved evaluation and training of multilingual LLMs.
Anthology ID:
2026.eacl-long.290
Volume:
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6128–6152
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.290/
DOI:
Bibkey:
Cite (ACL):
Amey Hengle, Prasoon Bajpai, Soham Dan, and Tanmoy Chakraborty. 2026. Can LLMs reason over extended multilingual contexts? Towards long-context evaluation beyond retrieval over haystacks. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6128–6152, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Can LLMs reason over extended multilingual contexts? Towards long-context evaluation beyond retrieval over haystacks (Hengle et al., EACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.290.pdf