DRAGON: Domain-specific Robust Automatic Data Generation for RAG Optimization
Haiyang Shen, Hang Yan, Zhongshi Xing, Mugeng Liu, Yue Li, Zhiyang Chen, Yuxiang Wang, Jiuzheng Wang, Yun Ma
Abstract
Retrieval-augmented generation (RAG) can substantially enhance the performance of LLMs on knowledge-intensive tasks. Various RAG paradigms—including vanilla, planning-based, and iterative RAG—all depend on a robust retriever, yet existing retrievers rely heavily on public knowledge and often falter when faced with domain-specific queries. To address these limitations, we introduce DRAGON, a framework that combines a data-construction modeling approach with a scalable synthetic data-generation pipeline, specifically designed to optimize domain-specific retrieval performance and bolster retriever robustness. To evaluate RAG performance on domain-specific RAGs, we propose DRAGONBench, a benchmark spanning 8 domain-specific document collections across 4 distinct fields and featuring a wide spectrum of query complexities, answerability, and hops. Leveraging DRAGON, we generate a large-scale synthetic dataset—encompassing both single-hop and multi-hop queries—to enrich retriever training. Extensive experiments demonstrate that retrievers trained on this data yield significant performance gains and exhibit strong cross-domain generalization. Moreover, when our optimized retrievers are integrated into vanilla, planning-based, and iterative RAG paradigms, we observe consistent end-to-end improvements in system accuracy.- Anthology ID:
- 2026.findings-eacl.56
- Volume:
- Findings of the Association for Computational Linguistics: EACL 2026
- Month:
- March
- Year:
- 2026
- Address:
- Rabat, Morocco
- Editors:
- Vera Demberg, Kentaro Inui, Lluís Marquez
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1065–1078
- Language:
- URL:
- https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.56/
- DOI:
- Cite (ACL):
- Haiyang Shen, Hang Yan, Zhongshi Xing, Mugeng Liu, Yue Li, Zhiyang Chen, Yuxiang Wang, Jiuzheng Wang, and Yun Ma. 2026. DRAGON: Domain-specific Robust Automatic Data Generation for RAG Optimization. In Findings of the Association for Computational Linguistics: EACL 2026, pages 1065–1078, Rabat, Morocco. Association for Computational Linguistics.
- Cite (Informal):
- DRAGON: Domain-specific Robust Automatic Data Generation for RAG Optimization (Shen et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.56.pdf