CRAB: A Benchmark for Evaluating Curation of Retrieval-Augmented LLMs in Biomedicine

Hanmeng Zhong, Linqing Chen, Wentao Wu, Weilei Wang


Abstract
Recent development in Retrieval-Augmented Large Language Models (LLMs) have shown great promise in biomedical applications. However, a critical gap persists in reliably evaluating their curation ability—the process by which models select and integrate relevant references while filtering out noise. To address this, we introduce the benchmark for Curation of Retrieval-Augmented LLMs in Biomedicine (CRAB), the first multilingual benchmark tailored for evaluating the biomedical curation of retrieval-augmented LLMs, available in English, French, German and Chinese. By incorporating a novel citation-based evaluation metric, CRAB quantifies the curation performance of retrieval-augmented LLMs in biomedicine. Experimental results reveal significant discrepancies in the curation performance of mainstream LLMs, underscoring the urgent need to improve it in the domain of biomedicine.
Anthology ID:
2025.emnlp-industry.3
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:
November
Year:
2025
Address:
Suzhou (China)
Editors:
Saloni Potdar, Lina Rojas-Barahona, Sebastien Montella
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
34–49
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-industry.3/
DOI:
Bibkey:
Cite (ACL):
Hanmeng Zhong, Linqing Chen, Wentao Wu, and Weilei Wang. 2025. CRAB: A Benchmark for Evaluating Curation of Retrieval-Augmented LLMs in Biomedicine. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 34–49, Suzhou (China). Association for Computational Linguistics.
Cite (Informal):
CRAB: A Benchmark for Evaluating Curation of Retrieval-Augmented LLMs in Biomedicine (Zhong et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-industry.3.pdf