Biomed-Enriched: Data-Efficient Biomedical Pretraining via Paragraph-Level Annotation

Rian Touchent, Nathan Godey, \'Eric Villemonte de la Clergerie


Abstract
We annotate PubMed Central paragraphs for document type, domain, and educational quality using a two-stage pipeline: Llama-3.1-70B labels 400K paragraphs, then a fine-tuned XLM-RoBERTa propagates annotations to the full corpus. This paragraph-level approach captures content diversity within scientific articles that document-level labels miss. The resulting Biomed-Enriched corpus contains 2M clinical case paragraphs, providing a publicly available alternative to restricted clinical datasets. For decoders, continual pretraining experiments enable targeted improvements, with clinical upsampling boosting performance by 4 points on MMLU ProfMed and educational filtering improving MedQA and MedMCQA by ~1 point. Combinations of these techniques led to faster convergence, reaching the same performance with a third of training tokens. For encoders, our best recipe matches BioClinical-ModernBERT on 11 tasks (77.3% vs 77.1% F1) while using 2.5x fewer tokens and only public data.
Anthology ID:
2026.findings-acl.1713
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
34276–34287
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1713/
DOI:
Bibkey:
Cite (ACL):
Rian Touchent, Nathan Godey, and \'Eric Villemonte de la Clergerie. 2026. Biomed-Enriched: Data-Efficient Biomedical Pretraining via Paragraph-Level Annotation. In Findings of the Association for Computational Linguistics: ACL 2026, pages 34276–34287, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Biomed-Enriched: Data-Efficient Biomedical Pretraining via Paragraph-Level Annotation (Touchent et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1713.pdf
Checklist:
 2026.findings-acl.1713.checklist.pdf