BrowseComp-Plus: A Fair and Disentangled Evaluation Benchmark for Deep Search Agents

Zijian Chen; Xueguang Ma; Shengyao Zhuang; Ping Nie; Kai Zou; Sahel Sharifymoghaddam; Andrew Liu; Joshua Green; Kshama Patel; Ruoxi Meng; Mingyi Su; Yanxi Li; Haoran Hong; Xinyu Shi; Xuye Liu; Hosna Oyarhoseini; Nandan Thakur; Crystina Zhang; Luyu Gao; Wenhu Chen; Jimmy Lin

BrowseComp-Plus: A Fair and Disentangled Evaluation Benchmark for Deep Search Agents

Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Sahel Sharifymoghaddam, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Hosna Oyarhoseini, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, Jimmy Lin

Abstract

Deep search agents that combine large language models with retrieval tools excel at complex, multi-hop queries. Yet, existing benchmarks such as BrowseComp rely on black-box web search APIs, facing key limitations. (1) Fairness: for agents, dynamic and opaque web APIs hinder reproducibility and fair comparisons across agents. (2) Disentanglement: for retrieval, the lack of a fixed document corpus makes it impossible to isolate retriever contributions from end-to-end search agent accuracy. We introduce BrowseComp-Plus, a benchmark derived from BrowseComp that employs a fixed, human-verified corpus, enabling controlled retrieval for deep search agents. BrowseComp-Plus clearly distinguishes agent performance: with a BM25 retriever, the open-source Search-R1 achieves 3.86% accuracy, while GPT-5 achieves 55.9%. Additionally, BrowseComp-Plus makes retrieval gains explicit: pairing GPT-5 with Qwen3-Embedding-8B retriever further improves accuracy to 70.1% while reducing search calls. Overall, BrowseComp-Plus provides a fair and disentangled testbed, advancing both deep search agent evaluation and retrieval research for agentic search. Code and data can be found at: https://texttron.github.io/BrowseComp-Plus/

Anthology ID:: 2026.acl-long.1023
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 22349–22370
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1023/
DOI:
Bibkey:
Cite (ACL):: Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Sahel Sharifymoghaddam, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Hosna Oyarhoseini, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, and Jimmy Lin. 2026. BrowseComp-Plus: A Fair and Disentangled Evaluation Benchmark for Deep Search Agents. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 22349–22370, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: BrowseComp-Plus: A Fair and Disentangled Evaluation Benchmark for Deep Search Agents (Chen et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1023.pdf
Checklist:: 2026.acl-long.1023.checklist.pdf

PDF Cite Search Checklist Fix data