Puxuan Yu
2026
Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers
Qingcheng Zeng | Yuheng Lu | Zeqi Zhou | Heli Qi | Puxuan Yu | Fuheng Zhao | Hitomi Yanaka | Weihao Xuan | Naoto Yokoya
Findings of the Association for Computational Linguistics: ACL 2026
Qingcheng Zeng | Yuheng Lu | Zeqi Zhou | Heli Qi | Puxuan Yu | Fuheng Zhao | Hitomi Yanaka | Weihao Xuan | Naoto Yokoya
Findings of the Association for Computational Linguistics: ACL 2026
Code-switching is a pervasive linguistic phenomenon in global communication, yet modern information retrieval systems remain predominantly designed for, and evaluated within, monolingual contexts. To bridge this critical disconnect, we present a holistic study dedicated to code-switching IR. We introduce CSR-L (Code-Switching Retrieval benchmark-Lite), constructing a dataset via human annotation to capture the authentic naturalness of mixed-language queries. Our evaluation across statistical, dense, and late-interaction paradigms reveals that code-switching acts as a fundamental performance bottleneck, degrading the effectiveness of even robust multilingual models. We demonstrate that this failure stems from substantial divergence in the embedding space between pure and code-switched text. Scaling this investigation, we propose CS-MTEB, a comprehensive benchmark covering 11 diverse tasks, where we observe performance declines of up to 27%. Finally, we show that standard multilingual techniques like vocabulary expansion are insufficient to resolve these deficits completely. These findings underscore the fragility of current systems and establish code-switching as a crucial frontier for future IR optimization.
2025
Explain then Rank: Scale Calibration of Neural Rankers Using Natural Language Explanations from LLMs
Puxuan Yu | Daniel Cohen | Hemank Lamba | Joel R. Tetreault | Alejandro Jaimes
Findings of the Association for Computational Linguistics: ACL 2025
Puxuan Yu | Daniel Cohen | Hemank Lamba | Joel R. Tetreault | Alejandro Jaimes
Findings of the Association for Computational Linguistics: ACL 2025
In search settings, calibrating the scores during the ranking process to quantities such as click-through rates or relevance levels enhances a system’s usefulness and trustworthiness for downstream users. While previous research has improved this notion of calibration for low complexity learning-to-rank models, the larger data demands and parameter count specific to modern neural text rankers produce unique obstacles that hamper the efficacy of methods intended for the learning-to-rank setting.This paper proposes exploiting large language models (LLMs) to provide relevance and uncertainty signals for these neural text rankers to produce scale-calibrated scores through Monte Carlo sampling of natural language explanations (NLEs). Our approach transforms the neural ranking task from ranking textual query-document pairs to ranking corresponding synthesized NLEs. Comprehensive experiments on two popular document ranking datasets show that the NLE-based calibration approach consistently outperforms past calibration methods and LLM-based methods for ranking, calibration, and query performance prediction tasks.
2024
Language Concept Erasure for Language-invariant Dense Retrieval
Zhiqi Huang | Puxuan Yu | Shauli Ravfogel | James Allan
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Zhiqi Huang | Puxuan Yu | Shauli Ravfogel | James Allan
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Multilingual models aim for language-invariant representations but still prominently encode language identity. This, along with the scarcity of high-quality parallel retrieval data, limits their performance in retrieval. We introduce LANCER, a multi-task learning framework that improves language-invariant dense retrieval by reducing language-specific signals in the embedding space. Leveraging the notion of linear concept erasure, we design a loss function that penalizes cross-correlation between representations and their language labels. LANCER leverages only English retrieval data and general multilingual corpora, training models to focus on language-invariant retrieval by semantic similarity without necessitating a vast parallel corpus. Experimental results on various datasets show our method consistently improves over baselines, with extensive analyses demonstrating greater language agnosticism.