MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering

Teng Lin; Yuyu Luo; Honglin Zhang; Jicheng Zhang; Chunlin Liu; Kaishun Wu; Nan Tang

MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering

Teng Lin, Yuyu Luo, Honglin Zhang, Jicheng Zhang, Chunlin Liu, Kaishun Wu, Nan Tang

Abstract

Cross-Document Multi-entity question answering (MEQA) demands the integration of scattered information across documents to resolve complex queries involving entities, relationships, and contextual dependencies. Although Large Language Models (LLMs) and Retrieval-augmented Generation (RAG) systems show promise, their performance on cross-document MEQA remains underexplored due to the absence of tailored benchmarks. To address this gap, we introduce MEBench, a scalable multi-document, multi-entity benchmark designed to systematically evaluate LLMs’ capacity to retrieve, consolidate, and reason over scattered and dense information. Our benchmark comprises 4,780 questions which are systematically categorized into three primary categories: Comparative Reasoning, Statistical Reasoning and Relational Reasoning, further divided into eight distinct types, ensuring broad coverage of real-world multi-entity reasoning scenarios. Our experiments on state-of-the-art LLMs reveal critical limitations: even advanced models achieve only 59% accuracy on MEBench. Our benchmark emphasizes the importance of completeness and factual precision of information extraction in MEQA tasks, using Entity-Attributed F1 (EA-F1) metric for granular evaluation of entity-level correctness and attribution validity. MEBench not only highlights systemic weaknesses in current LLM frameworks but also provides a foundation for advancing robust, entity-aware QA architectures.

Anthology ID:: 2025.emnlp-main.77
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1481–1494
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.77/
DOI:
Bibkey:
Cite (ACL):: Teng Lin, Yuyu Luo, Honglin Zhang, Jicheng Zhang, Chunlin Liu, Kaishun Wu, and Nan Tang. 2025. MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1481–1494, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering (Lin et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.77.pdf
Checklist:: 2025.emnlp-main.77.checklist.pdf

PDF Cite Search Checklist Fix data