Citation-Aware Continual Pre-Training for Biomedical Language Models

Masaki Asada, Tomoki Tsujimura, Tatsuya Ishigaki, Shusaku Egami, Ken Fukuda, Hiroya Takamura


Abstract
The biomedical literature contains rich structured knowledge, including citation links that encode relationships between scientific studies, but such information is typically ignored in standard language model pre-training. We propose a citation-aware continual pre-training method for decoder-only language models that incorporates citation graph information from PubMed into next-token prediction by placing citation-linked abstract pairs within a shared context. We evaluate our method on multiple biomedical QA benchmarks using two model families. Results show that citation-aware continual pre-training achieves higher average accuracy than both the original base models and citation-unaware pre-training across biomedical tasks.
Anthology ID:
2026.bionlp-1.32
Volume:
BioNLP 2026
Month:
July
Year:
2026
Address:
San Diego, California
Editors:
Dina Demner-Fushman, Sophia Ananiadou, Kirk Roberts, Junichi Tsujii
Venues:
BioNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
407–412
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.bionlp-1.32/
DOI:
Bibkey:
Cite (ACL):
Masaki Asada, Tomoki Tsujimura, Tatsuya Ishigaki, Shusaku Egami, Ken Fukuda, and Hiroya Takamura. 2026. Citation-Aware Continual Pre-Training for Biomedical Language Models. In BioNLP 2026, pages 407–412, San Diego, California. Association for Computational Linguistics.
Cite (Informal):
Citation-Aware Continual Pre-Training for Biomedical Language Models (Asada et al., BioNLP 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.bionlp-1.32.pdf