ResearchArena: Benchmarking Large Language Models’ Ability to Collect and Organize Information as Research Agents

Hao Kang; Chenyan Xiong

doi:10.18653/v1/2025.findings-emnlp.303

ResearchArena: Benchmarking Large Language Models’ Ability to Collect and Organize Information as Research Agents

Abstract

Large language models (LLMs) excel across many natural language processing tasks but face challenges in domain-specific, analytical tasks such as conducting research surveys. This study introduces ResearchArena, a benchmark designed to evaluate LLMs’ capabilities in conducting academic surveys—a foundational step in academic research. ResearchArena models the process in three stages: (1) information discovery, identifying relevant literature; (2) information selection, evaluating papers’ relevance and impact; and (3) information organization, structuring knowledge into hierarchical frameworks such as mind-maps. Notably, mind-map construction is treated as a bonus task, reflecting its supplementary role in survey-writing. To support these evaluations, we construct an offline environment of 12M full-text academic papers and 7.9K survey papers. To ensure ethical compliance, we do not redistribute copyrighted materials; instead, we provide code to construct the environment from the Semantic Scholar Open Research Corpus (S2ORC). Preliminary evaluations reveal that LLM-based approaches underperform compared to simpler keyword-based retrieval methods, though recent reasoning models such as DeepSeek-R1 show slightly better zero-shot performance. These results underscore significant opportunities for advancing LLMs in autonomous research. We open-source the code to construct the ResearchArena benchmark at https://github.com/cxcscmu/ResearchArena.

Anthology ID:: 2025.findings-emnlp.303
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5653–5671
Language:
URL:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.303/
DOI:: 10.18653/v1/2025.findings-emnlp.303
Bibkey:
Cite (ACL):: Hao Kang and Chenyan Xiong. 2025. ResearchArena: Benchmarking Large Language Models’ Ability to Collect and Organize Information as Research Agents. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5653–5671, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: ResearchArena: Benchmarking Large Language Models’ Ability to Collect and Organize Information as Research Agents (Kang & Xiong, Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.303.pdf
Checklist:: 2025.findings-emnlp.303.checklist.pdf

PDF Cite Search Checklist Fix data