AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG

Qijie You; Wenkai Yu; Hao Liang; Zhen Hao Wong; Wentao Zhang

AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG

Qijie You, Wenkai Yu, Hao Liang, Zhen Hao Wong, Wentao Zhang

Abstract

With the rapid advancement of agent-based methods in recent years, Agentic RAG has undoubtedly become an important research direction. Multi-hop reasoning, which requires models to engage in deliberate thinking and multi-step interaction, serves as a critical testbed for assessing such capabilities. However, existing benchmarks typically provide only final questions and answers, while lacking the intermediate hop-level questions that gradually connect atomic questions to the final multi-hop query. This limitation prevents researchers from analyzing at which step an agent fails and restricts more fine-grained evaluation of model capabilities. Moreover, most current benchmarks are manually constructed, which is both time-consuming and labor-intensive, while also limiting scalability and generalization. To address these challenges, we introduce AgenticRAGTracer, the first Agentic RAG benchmark that is primarily constructed automatically by large language models and designed to support step-by-step validation. Our benchmark spans multiple domains, contains 1,305 data points, and has no overlap with existing mainstream benchmarks. Extensive experiments demonstrate that even the best large language models perform poorly on our dataset. For instance, GPT-5 attains merely 22.6% EM accuracy on the hardest portion of our dataset. Hop-aware diagnosis reveals that failures are primarily driven by distorted reasoning chains—either collapsing prematurely or wandering into over-extension. This highlights a critical inability to allocate steps consistent with the task’s logical structure, providing a diagnostic dimension missing in traditional evaluations. Our code and data are available at https://github.com/YqjMartin/AgenticRAGTracer.

Anthology ID:: 2026.findings-acl.66
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1301–1335
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.66/
DOI:
Bibkey:
Cite (ACL):: Qijie You, Wenkai Yu, Hao Liang, Zhen Hao Wong, and Wentao Zhang. 2026. AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG. In Findings of the Association for Computational Linguistics: ACL 2026, pages 1301–1335, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG (You et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.66.pdf
Checklist:: 2026.findings-acl.66.checklist.pdf

PDF Cite Search Checklist Fix data