Ege Yiğit Çelik


2025

pdf bib
CiteBART: Learning to Generate Citations for Local Citation Recommendation
Ege Yiğit Çelik | Selma Tekir
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Local citation recommendation (LCR) suggests a set of papers for a citation placeholder within a given context. This paper introduces CiteBART, citation-specific pre-training within an encoder-decoder architecture, where author-date citation tokens are masked to learn to reconstruct them to fulfill LCR. The global version (CiteBART-Global) extends the local context with the citing paper’s title and abstract to enrich the learning signal. CiteBART-Global achieves state-of-the-art performance on LCR benchmarks except for the FullTextPeerRead dataset, which is quite small to see the advantage of generative pre-training. The effect is significant in the larger benchmarks, e.g., Refseer and ArXiv., with the Refseer pre-trained model emerging as the best-performing model. We perform comprehensive experiments, including an ablation study, a qualitative analysis, and a taxonomy of hallucinations with detailed statistics. Our analyses confirm that CiteBART-Global has a cross-dataset generalization capability; the macro hallucination rate (MaHR) at the top-3 predictions is 4%, and when the ground-truth is in the top-k prediction list, the hallucination tendency in the other predictions drops significantly. We publicly share our code, base datasets, global datasets, and pre-trained models to support reproducibility.