Biomed-Enriched: Data-Efficient Biomedical Pretraining via Paragraph-Level Annotation
Rian Touchent, Nathan Godey, \'Eric Villemonte de la Clergerie
Abstract
We annotate PubMed Central paragraphs for document type, domain, and educational quality using a two-stage pipeline: Llama-3.1-70B labels 400K paragraphs, then a fine-tuned XLM-RoBERTa propagates annotations to the full corpus. This paragraph-level approach captures content diversity within scientific articles that document-level labels miss. The resulting Biomed-Enriched corpus contains 2M clinical case paragraphs, providing a publicly available alternative to restricted clinical datasets. For decoders, continual pretraining experiments enable targeted improvements, with clinical upsampling boosting performance by 4 points on MMLU ProfMed and educational filtering improving MedQA and MedMCQA by ~1 point. Combinations of these techniques led to faster convergence, reaching the same performance with a third of training tokens. For encoders, our best recipe matches BioClinical-ModernBERT on 11 tasks (77.3% vs 77.1% F1) while using 2.5x fewer tokens and only public data.- Anthology ID:
- 2026.findings-acl.1713
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 34276–34287
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1713/
- DOI:
- Cite (ACL):
- Rian Touchent, Nathan Godey, and \'Eric Villemonte de la Clergerie. 2026. Biomed-Enriched: Data-Efficient Biomedical Pretraining via Paragraph-Level Annotation. In Findings of the Association for Computational Linguistics: ACL 2026, pages 34276–34287, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Biomed-Enriched: Data-Efficient Biomedical Pretraining via Paragraph-Level Annotation (Touchent et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1713.pdf