Zhen Hao Wong
2026
Data-Centric Perspectives on Agentic Retrieval-Augmented Generation: A Survey
Jingwen Deng | Jihao Huang | Zhen Hao Wong | Hao Liang | Quanqing Xu | Bin Cui | Wentao Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Jingwen Deng | Jihao Huang | Zhen Hao Wong | Hao Liang | Quanqing Xu | Bin Cui | Wentao Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Models (LLMs) excel at natural language understanding and generation, yet their reliance on static pre-training corpora may lead to outdated knowledge, hallucinations, and limited adaptability. Retrieval-Augmented Generation (RAG) mitigates these issues by grounding model outputs with external retrieval, but conventional RAG remains constrained by a fixed retrieve-then-generate routine and struggles with multi-step reasoning and tool calls. **Agentic RAG** addresses these limitations by enabling LLM agents to actively decompose tasks, issue exploratory queries, and refine evidence through iterative retrieval. Despite growing interest, the development of Agentic RAG is impeded by *data scarcity*: unlike traditional RAG, it requires challenging tasks that require planning, retrieval, and multiple reasoning decisions, and corresponding rich, interactive agent trajectories. This survey presents the first data-centric overview of Agentic RAG, framing its data lifecycle—data collecting, data preprocessing and task formulation, task construction, data for evaluation, and data enhancement for training—and cataloging representative training datasets and benchmarks in different domains (e.g. question answering, web, software engineering). From data perspectives, we aim to guide the creation of scalable, high-quality datasets for the next generation of adaptive, knowledge-seeking LLM agents. The project page is at https://github.com/fatty-belly/Awesome-AgenticRAG-Data/.
AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG
Qijie You | Wenkai Yu | Hao Liang | Zhen Hao Wong | Wentao Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Qijie You | Wenkai Yu | Hao Liang | Zhen Hao Wong | Wentao Zhang
Findings of the Association for Computational Linguistics: ACL 2026
With the rapid advancement of agent-based methods in recent years, Agentic RAG has undoubtedly become an important research direction. Multi-hop reasoning, which requires models to engage in deliberate thinking and multi-step interaction, serves as a critical testbed for assessing such capabilities. However, existing benchmarks typically provide only final questions and answers, while lacking the intermediate hop-level questions that gradually connect atomic questions to the final multi-hop query. This limitation prevents researchers from analyzing at which step an agent fails and restricts more fine-grained evaluation of model capabilities. Moreover, most current benchmarks are manually constructed, which is both time-consuming and labor-intensive, while also limiting scalability and generalization. To address these challenges, we introduce AgenticRAGTracer, the first Agentic RAG benchmark that is primarily constructed automatically by large language models and designed to support step-by-step validation. Our benchmark spans multiple domains, contains 1,305 data points, and has no overlap with existing mainstream benchmarks. Extensive experiments demonstrate that even the best large language models perform poorly on our dataset. For instance, GPT-5 attains merely 22.6% EM accuracy on the hardest portion of our dataset. Hop-aware diagnosis reveals that failures are primarily driven by distorted reasoning chains—either collapsing prematurely or wandering into over-extension. This highlights a critical inability to allocate steps consistent with the task’s logical structure, providing a diagnostic dimension missing in traditional evaluations. Our code and data are available at https://github.com/YqjMartin/AgenticRAGTracer.
2025
Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration
Tianyi Bai | Ling Yang | Zhen Hao Wong | Fupeng Sun | Xinlin Zhuang | Jiahui Peng | Chi Zhang | Lijun Wu | Qiu Jiantao | Wentao Zhang | Binhang Yuan | Conghui He
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tianyi Bai | Ling Yang | Zhen Hao Wong | Fupeng Sun | Xinlin Zhuang | Jiahui Peng | Chi Zhang | Lijun Wu | Qiu Jiantao | Wentao Zhang | Binhang Yuan | Conghui He
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Efficient data selection is crucial to accelerate the pretraining of language model (LMs). While various methods have been proposed to enhance data efficiency, limited research has addressed the inherent conflicts between these approaches to achieve optimal data selection for LM pretraining. To tackle this problem, we propose a multi-actor collaborative data selection mechanism. Each data selection method independently prioritizes data based on its specific criterion and updates its prioritization rules using the current state of the model, functioning as an independent actor for data selection. Additionally, a console is designed to adjust the impacts of different actors at various stages and dynamically integrate information from all actors throughout the LM pretraining process. We conduct extensive empirical studies to evaluate our multi-actor framework. The experimental results demonstrate that our approach significantly improves data efficiency, accelerates convergence in LM pretraining, and achieves an average relative performance gain up to 10.5% across multiple language model benchmarks compared to the state-of-the-art methods.