Jacob Mitchell Springer


2025

pdf bib
Mitigating Bias in RAG: Controlling the Embedder
Taeyoun Kim | Jacob Mitchell Springer | Aditi Raghunathan | Maarten Sap
Findings of the Association for Computational Linguistics: ACL 2025

In retrieval augmented generation (RAG) systems, each individual component—the LLM, embedder, and corpus—could introduce biases in the form of skews towards certain genders or political leanings. In this work, we study the conflict between biases of each component and their relationship to the overall bias of the RAG system, which we call bias conflict. Examining both gender and political biases as case studies, we show that bias conflict can be characterized through a linear relationship among components despite its complexity. Through fine-tuning, we demonstrate how to control the bias of the embedder while maintaining utility and reveal the importance of reverse-biasing the embedder to mitigate bias in the overall system, Additionally, we find that LLMs and tasks exhibit varying sensitivities to bias, a crucial factor to consider for debiasing. Our results underscore that a fair RAG system can be better achieved by carefully controlling the bias of the embedder rather than increasing its fairness.

pdf bib
Understanding the Influence of Synthetic Data for Text Embedders
Jacob Mitchell Springer | Vaibhav Adlakha | Siva Reddy | Aditi Raghunathan | Marius Mosbach
Findings of the Association for Computational Linguistics: ACL 2025

Recent progress in developing general purpose text embedders has been driven by training on ever-growing corpora of synthetic LLM-generated data. Nonetheless, no publicly available synthetic dataset exists, posing a barrier to studying its role for generalization. To address this issue, we first reproduce and publicly release the synthetic data proposed by Wang et al. (2024) (Mistral-E5). Our synthetic data is high quality and leads to consistent improvements in performance. Next, we critically examine where exactly synthetic data improves model generalization. Our analysis reveals that benefits from synthetic data are sparse and highly localized to individual datasets. Moreover, we observe trade-offs between the performance on different categories and data that benefits one task, degrades performance on another. Our findings highlight the limitations of current synthetic data approaches for building general-purpose embedders and challenge the notion that training on synthetic data leads to more robust embedding models across tasks.