LinkBERT: Pretraining Language Models with Document Links

Michihiro Yasunaga, Jure Leskovec, Percy Liang


Abstract
Language model (LM) pretraining captures various knowledge from text corpora, helping downstream tasks. However, existing methods such as BERT model a single document, and do not capture dependencies or knowledge that span across documents. In this work, we propose LinkBERT, an LM pretraining method that leverages links between documents, e.g., hyperlinks. Given a text corpus, we view it as a graph of documents and create LM inputs by placing linked documents in the same context. We then pretrain the LM with two joint self-supervised objectives: masked language modeling and our new proposal, document relation prediction. We show that LinkBERT outperforms BERT on various downstream tasks across two domains: the general domain (pretrained on Wikipedia with hyperlinks) and biomedical domain (pretrained on PubMed with citation links). LinkBERT is especially effective for multi-hop reasoning and few-shot QA (+5% absolute improvement on HotpotQA and TriviaQA), and our biomedical LinkBERT sets new states of the art on various BioNLP tasks (+7% on BioASQ and USMLE). We release our pretrained models, LinkBERT and BioLinkBERT, as well as code and data.
Anthology ID:
2022.acl-long.551
Volume:
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8003–8016
Language:
URL:
https://aclanthology.org/2022.acl-long.551
DOI:
10.18653/v1/2022.acl-long.551
Bibkey:
Cite (ACL):
Michihiro Yasunaga, Jure Leskovec, and Percy Liang. 2022. LinkBERT: Pretraining Language Models with Document Links. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8003–8016, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
LinkBERT: Pretraining Language Models with Document Links (Yasunaga et al., ACL 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/add_acl24_videos/2022.acl-long.551.pdf
Code
 michiyasunaga/LinkBERT
Data
BC2GMBC5CDRBIOSSESBLURBBioASQBookCorpusChemProtDDIGADGLUEHOCHotpotQAJNLPBAMMLUMRQAMedQANCBI DiseaseNatural QuestionsNewsQAPubMedQASQuADSearchQATriviaQA