Evaluating the Effectiveness and Scalability of LLM-Based Data Augmentation for Retrieval

Pranjal A. Chitale; Bishal Santra; Yashoteja Prabhu; Amit Sharma

Evaluating the Effectiveness and Scalability of LLM-Based Data Augmentation for Retrieval

Pranjal A Chitale, Bishal Santra, Yashoteja Prabhu, Amit Sharma

Abstract

Compact dual-encoder models are widely used for retrieval owing to their efficiency and scalability. However, such models often underperform compared to their Large Language Model (LLM)-based retrieval counterparts, likely due to their limited world knowledge. While LLM-based data augmentation has been proposed as a strategy to bridge this performance gap, there is insufficient understanding of its effectiveness and scalability to real-world retrieval problems. Existing research does not systematically explore key factors such as the optimal augmentation scale, the necessity of using large augmentation models, and whether diverse augmentations improve generalization, particularly in out-of-distribution (OOD) settings. This work presents a comprehensive study of the effectiveness of LLM augmentation for retrieval, comprising over 100 distinct experimental settings of retrieval models, augmentation models and augmentation strategies. We find that, while augmentation enhances retrieval performance, its benefits diminish beyond a certain scale, even with diverse augmentation strategies. Surprisingly, we observe that augmentation with smaller LLMs can achieve performance competitive with larger augmentation models. Moreover, we examine how augmentation effectiveness varies with retrieval model pre-training, revealing that augmentation provides the most benefit to models which are not well pre-trained. Our insights pave the way for more judicious and efficient augmentation strategies, thus enabling informed decisions and maximizing retrieval performance while being more cost-effective.

Anthology ID:: 2025.emnlp-main.888
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 17592–17628
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.888/
DOI:
Bibkey:
Cite (ACL):: Pranjal A Chitale, Bishal Santra, Yashoteja Prabhu, and Amit Sharma. 2025. Evaluating the Effectiveness and Scalability of LLM-Based Data Augmentation for Retrieval. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 17592–17628, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Evaluating the Effectiveness and Scalability of LLM-Based Data Augmentation for Retrieval (Chitale et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.888.pdf
Checklist:: 2025.emnlp-main.888.checklist.pdf

PDF Cite Search Checklist Fix data