Studying the Role of Input-Neighbor Overlap in Retrieval-Augmented Language Models Training Efficiency

Ehsan Doostmohammadi, Marco Kuhlmann


Abstract
Retrieval-augmented language models have demonstrated performance comparable to much larger models while requiring fewer computational resources. The effectiveness of these models crucially depends on the overlap between query and retrieved context, but the optimal degree of this overlap remains unexplored. In this paper, we systematically investigate how varying levels of query–context overlap affect model performance during both training and inference. Our experiments reveal that increased overlap initially has minimal effect, but substantially improves test-time perplexity and accelerates model learning above a critical threshold. Building on these findings, we demonstrate that deliberately increasing overlap through synthetic context can enhance data efficiency and reduce training time by approximately 40% without compromising performance. We specifically generate synthetic context through paraphrasing queries. We validate our perplexity-based findings on question-answering tasks, confirming that the benefits of retrieval-augmented language modeling extend to practical applications. Our results provide empirical evidence of significant optimization potential for retrieval mechanisms in language model pretraining.
Anthology ID:
2025.emnlp-main.1363
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
26835–26844
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1363/
DOI:
Bibkey:
Cite (ACL):
Ehsan Doostmohammadi and Marco Kuhlmann. 2025. Studying the Role of Input-Neighbor Overlap in Retrieval-Augmented Language Models Training Efficiency. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 26835–26844, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Studying the Role of Input-Neighbor Overlap in Retrieval-Augmented Language Models Training Efficiency (Doostmohammadi & Kuhlmann, EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1363.pdf
Checklist:
 2025.emnlp-main.1363.checklist.pdf