Sahel Sharifymoghaddam
2026
BrowseComp-Plus: A Fair and Disentangled Evaluation Benchmark for Deep Search Agents
Zijian Chen | Xueguang Ma | Shengyao Zhuang | Ping Nie | Kai Zou | Sahel Sharifymoghaddam | Andrew Liu | Joshua Green | Kshama Patel | Ruoxi Meng | Mingyi Su | Yanxi Li | Haoran Hong | Xinyu Shi | Xuye Liu | Hosna Oyarhoseini | Nandan Thakur | Crystina Zhang | Luyu Gao | Wenhu Chen | Jimmy Lin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zijian Chen | Xueguang Ma | Shengyao Zhuang | Ping Nie | Kai Zou | Sahel Sharifymoghaddam | Andrew Liu | Joshua Green | Kshama Patel | Ruoxi Meng | Mingyi Su | Yanxi Li | Haoran Hong | Xinyu Shi | Xuye Liu | Hosna Oyarhoseini | Nandan Thakur | Crystina Zhang | Luyu Gao | Wenhu Chen | Jimmy Lin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Deep search agents that combine large language models with retrieval tools excel at complex, multi-hop queries. Yet, existing benchmarks such as BrowseComp rely on black-box web search APIs, facing key limitations. (1) Fairness: for agents, dynamic and opaque web APIs hinder reproducibility and fair comparisons across agents. (2) Disentanglement: for retrieval, the lack of a fixed document corpus makes it impossible to isolate retriever contributions from end-to-end search agent accuracy. We introduce BrowseComp-Plus, a benchmark derived from BrowseComp that employs a fixed, human-verified corpus, enabling controlled retrieval for deep search agents. BrowseComp-Plus clearly distinguishes agent performance: with a BM25 retriever, the open-source Search-R1 achieves 3.86% accuracy, while GPT-5 achieves 55.9%. Additionally, BrowseComp-Plus makes retrieval gains explicit: pairing GPT-5 with Qwen3-Embedding-8B retriever further improves accuracy to 70.1% while reducing search calls. Overall, BrowseComp-Plus provides a fair and disentangled testbed, advancing both deep search agent evaluation and retrieval research for agentic search. Code and data can be found at: https://texttron.github.io/BrowseComp-Plus/
Rerank Before You Reason: Analyzing Reranking Tradeoffs through Effective Token Cost in Deep Search Agents
Sahel Sharifymoghaddam | Jimmy Lin
Findings of the Association for Computational Linguistics: ACL 2026
Sahel Sharifymoghaddam | Jimmy Lin
Findings of the Association for Computational Linguistics: ACL 2026
Deep research agents rely on iterative retrieval and reasoning to answer complex queries, but scaling test-time computation raises significant efficiency concerns. We study how to allocate reasoning budget in deep search pipelines, focusing on the role of listwise reranking. Using the BrowseComp-Plus benchmark, we analyze tradeoffs between model scale, reasoning effort, reranking depth, and total token cost via a novel effective token cost (ETC) metric. Our results show that reranking consistently improves retrieval and end-to-end accuracy, and that moderate reranking often yields larger gains than increasing search-time reasoning, achieving comparable accuracy at substantially lower cost. All our code is available at https://github.com/sahel-sh/DeepHone.
2025
UniRAG: Universal Retrieval Augmentation for Large Vision Language Models
Sahel Sharifymoghaddam | Shivani Upadhyay | Wenhu Chen | Jimmy Lin
Findings of the Association for Computational Linguistics: NAACL 2025
Sahel Sharifymoghaddam | Shivani Upadhyay | Wenhu Chen | Jimmy Lin
Findings of the Association for Computational Linguistics: NAACL 2025
Recently, Large Vision Language Models (LVLMs) have unlocked many complex use cases that require Multi-Modal (MM) understanding (e.g., image captioning or visual question answering) and MM generation (e.g., text-guided image generation or editing) capabilities. To further improve the output fidelity of LVLMs we introduce UniRAG, a plug-and-play technique that adds relevant retrieved information to prompts as few-shot examples during inference. Unlike the common belief that Retrieval Augmentation (RA) mainly improves generation or understanding of uncommon entities, our evaluation results on the MSCOCO dataset with common entities show that both proprietary models like GPT-4o and Gemini-Pro and smaller open-source models like LLaVA, LaVIT, and Emu2 significantly enhance their generation quality when their input prompts are augmented with relevant information retrieved by Vision-Language (VL) retrievers like UniIR models. All the necessary code to reproduce our results is available at https://github.com/castorini/UniRAG.