Training Biomedical Retrievers From Large-Scale Citation Contexts

Xing David Wang, Duy Le Thanh, Ulf Leser


Abstract
The MedCPT model has demonstrated that strong biomedical retrievers can be trained using proprietary PubMed search logs. In this work, we study whether freely available citation sentences are sufficient to train similarly effective models. We construct a large-scale training dataset of ~ 62 million citation sentence-abstract pairs extracted from PubMed Central. We train a lightweight BERT-based retriever-reranker model called CiteRec on this dataset and evaluate it across three benchmark settings: (a) the biomedical subset of BEIR for information retrieval, (b) SciRepEval for generalizable scientific document embeddings, and (c) CitancePlus, a new set of ~ 90 thousand citation sentence-abstract pairs for PubMed-scale citation recommendation. We show that CiteRec performs competitively with MedCPT on the biomedical BEIR subset and outperforms it on SciRepEval. On CitancePlus, CiteRec achieves strong performance for citation recommendation over the full PubMed corpus, outperforming both MedCPT and a substantially larger Qwen3-Embedding-8B retriever.
Anthology ID:
2026.bionlp-1.7
Volume:
BioNLP 2026
Month:
July
Year:
2026
Address:
San Diego, California
Editors:
Dina Demner-Fushman, Sophia Ananiadou, Kirk Roberts, Junichi Tsujii
Venues:
BioNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
75–83
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.bionlp-1.7/
DOI:
Bibkey:
Cite (ACL):
Xing David Wang, Duy Le Thanh, and Ulf Leser. 2026. Training Biomedical Retrievers From Large-Scale Citation Contexts. In BioNLP 2026, pages 75–83, San Diego, California. Association for Computational Linguistics.
Cite (Informal):
Training Biomedical Retrievers From Large-Scale Citation Contexts (Wang et al., BioNLP 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.bionlp-1.7.pdf