A Pipeline to Bootstrap the Evaluation of Retrieval-Augmented Generation for the Automation of Systematic Reviews in Computer Science
Pierre Achkar, Tim Gollub, Arno Simons, Harrisen Scells, Maik Fröbe, Martin Potthast
Abstract
Automating systematic reviews (SRs), i.e., evidence-driven analyses under explicit protocol constraints, is a natural target for retrieval-augmented generation and deep research agents, yet existing benchmarks evaluate isolated subtasks or assume fixed evidence inputs. We introduce RAG4SR-CS-200, a benchmark of 200 computer science systematic reviews designed for protocol-driven systematic review automation. Each instance comprises review objectives, research questions, eligibility criteria, cleaned full-text review structure, references, and extracted tables. These elements support evaluation across key tasks in systematic review creation such as literature retrieval, eligibility screening, citation-grounded review generation, and structured table generation, in both stage-wise and end-to-end settings. RAG4SR-CS-200 provides a foundation for developing more reliable and diagnosable deep research agents for scientific evidence synthesis. Code and data are publicly available (https://github.com/webis-de/rag4sr-cs-200).- Anthology ID:
- 2026.rag4reports-1.8
- Volume:
- Proceedings of the 1st Workshop on Multilingual Report Generation via Retrieval Augmented Generation (RAG4Reports 2026)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, CA, USA
- Editors:
- Eugene Yang, Dawn Lawrie, Sean MacAvaney, James Mayfield, Luca Soldaini, Andrew Yates
- Venues:
- RAG4Reports | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 65–70
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.rag4reports-1.8/
- DOI:
- Cite (ACL):
- Pierre Achkar, Tim Gollub, Arno Simons, Harrisen Scells, Maik Fröbe, and Martin Potthast. 2026. A Pipeline to Bootstrap the Evaluation of Retrieval-Augmented Generation for the Automation of Systematic Reviews in Computer Science. In Proceedings of the 1st Workshop on Multilingual Report Generation via Retrieval Augmented Generation (RAG4Reports 2026), pages 65–70, San Diego, CA, USA. Association for Computational Linguistics.
- Cite (Informal):
- A Pipeline to Bootstrap the Evaluation of Retrieval-Augmented Generation for the Automation of Systematic Reviews in Computer Science (Achkar et al., RAG4Reports 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.rag4reports-1.8.pdf