Sercan O Arik

Also published as: Sercan Arik

2025

pdf bib abs
Embedding-Converter: A Unified Framework for Cross-Model Embedding Transformation
Jinsung Yoon | Sercan O Arik
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Embedding models play a crucial role in machine learning. However, the continuous development of new models presents a major challenge: migrating to a potentially superior model often requires the computationally expensive process of re-embedding entire datasets—without any guarantee of performance improvement. This paper presents Embedding-Converter, a novel framework for efficiently transforming embeddings between different models, thus avoiding costly ‘re-embedding’. The proposed approach achieves 100 times faster and cheaper computations in real-world applications. Experiments show that Embedding-Converter not only streamlines transitions to new models, but can also improve upon the source model’s performance, approaching that of the target model. This facilitates efficient evaluation and broader adoption of new embedding models by significantly reducing the overhead of model switching. Furthermore, Embedding-Converter addresses latency limitations by enabling the use of smaller models for online tasks while still benefiting from the performance of larger models offline. By promoting the release of converters alongside new embedding models, Embedding-Converter fosters a more dynamic and accessible ecosystem for embedding model development and deployment.

pdf bib abs
Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models
Fei Wang | Xingchen Wan | Ruoxi Sun | Jiefeng Chen | Sercan O Arik
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Retrieval augmented generation (RAG), while effectively integrating external knowledge to address the inherent limitations of large language models (LLMs), can be hindered by imperfect retrieval that contain irrelevant, misleading, or even malicious information. Previous studies have rarely connected the behavior of RAG through joint analysis, particularly regarding error propagation coming from imperfect retrieval and potential conflicts between LLMs’ internal knowledge and external sources. Through comprehensive and controlled analyses under realistic conditions, we find that imperfect retrieval augmentation is inevitable, common, and harmful. We identify the knowledge conflicts between LLM-internal and external knowledge from retrieval as a bottleneck to overcome imperfect retrieval in the post-retrieval stage of RAG. To address this, we propose Astute RAG, a novel RAG approach designed to be resilient to imperfect retrieval augmentation. It adaptively elicits essential information from LLMs’ internal knowledge, iteratively consolidates internal and external knowledge with source-awareness, and finalizes the answer according to information reliability. Our experiments with Gemini and Claude demonstrate the superior performance of Astute RAG compared to previous robustness-enhanced RAG approaches. Specifically, Astute RAG is the only RAG method that achieves performance comparable to or even surpassing conventional use of LLMs under the worst-case scenario. Further analysis reveals the effectiveness of Astute RAG in resolving knowledge conflicts, thereby improving the trustworthiness of RAG.

pdf bib abs
Data-Centric Improvements for Enhancing Multi-Modal Understanding in Spoken Conversation Modeling
Maximillian Chen | Ruoxi Sun | Sercan O Arik
Findings of the Association for Computational Linguistics: ACL 2025

Conversational assistants are increasingly popular across diverse real-world applications, highlighting the need for advanced multimodal speech modeling. Speech, as a natural mode of communication, encodes rich user-specific characteristics such as speaking rate and pitch, making it critical for effective interaction. Our work introduces a data-centric customization approach for efficiently enhancing multimodal understanding in conversational speech modeling. Central to our contributions is a novel multi-task learning paradigm that involves designing auxiliary tasks to utilize a small amount of speech data. Our approach achieves state-of-the-art performance on the Spoken-SQuAD benchmark, using only 10% of the training data with open-weight models, establishing a robust and efficient framework for audio-centric conversational modeling. We also introduce ASK-QA, the first dataset for multi-turn spoken dialogue with ambiguous user requests and dynamic evaluation inputs.

pdf bib abs
SQUARE: Unsupervised Retrieval Adaptation via Synthetic Data
Jinsung Yoon | Junhao Zeng | Sercan O Arik
Findings of the Association for Computational Linguistics: EMNLP 2025

Pre-trained retrieval models often face challenges in zero-shot retrieval for knowledge-based question answering, as different tasks rely on different corpora. We introduce SQUARE (Synthetic QUery-based Adaptive REtrieval), a novel method for corpus-specific unsupervised retrieval customization. SQUARE leverages LLMs to generate grounded synthetic question-answer pairs from the corpus, which are then used to fine-tune the retriever. A filtering mechanism based on the synthetic answers is employed to ensure high quality of tuning data. Extensive experiments on various datasets demonstrate superior performance of SQUARE compared to zero-shot retrieval and other customization methods, highlighting the value of corpus adaptation for effective retrieval.

pdf bib abs
FLAIRR-TS - Forecasting LLM-Agents with Iterative Refinement and Retrieval for Time Series
Gunjan Jalori | Preetika Verma | Sercan O Arik
Findings of the Association for Computational Linguistics: EMNLP 2025

Time series Forecasting with large language models (LLMs) requires bridging numerical patterns and natural language. Effective forecasting on LLM often relies on extensive pre-processing and fine-tuning. Recent studies show that a frozen LLM can rival specialized forecasters when supplied with a carefully engineered natural-language prompt, but crafting such a prompt for each task is itself onerous and ad-hoc. We introduce FLAIRR-TS, a test-time prompt optimization framework that utilizes an agentic system: a Forecaster-agent generates forecasts using an initial prompt, which is then refined by a refiner agent, informed by past outputs and retrieved analogs. This adaptive prompting generalizes across domains using creative prompt templates and generates high-quality forecasts without intermediate code generation. Experiments on benchmark datasets show FLAIRR-TS improves forecasting over static prompting and retrieval-augmented baselines, approaching the performance of specialized prompts.FLAIRR-TS provides a practical alternative to fine-tuning, achieving strong performance via its agentic approach to adaptive prompt refinement and retrieval.

2024

pdf bib abs
Search-Adaptor: Embedding Customization for Information Retrieval
Jinsung Yoon | Yanfei Chen | Sercan Arik | Tomas Pfister
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Embeddings extracted by pre-trained Large Language Models (LLMs) have significant potential to improve information retrieval and search. Beyond the zero-shot setup in which they are being conventionally used, being able to take advantage of the information from the relevant query-corpus paired data can further boost the LLM capabilities. In this paper, we propose a novel method, Search-Adaptor, for customizing LLMs for information retrieval in an efficient and robust way. Search-Adaptor modifies the embeddings generated by pre-trained LLMs, and can be integrated with any LLM, including those only available via prediction APIs. On multiple English, multilingual, and multimodal retrieval datasets, we show consistent and significant performance benefits for Search-Adaptor – e.g., more than 5% improvements for Google Embedding APIs in nDCG@10 averaged over 14 BEIR datasets.

pdf bib abs
Matryoshka-Adaptor: Unsupervised and Supervised Tuning for Smaller Embedding Dimensions
Jinsung Yoon | Rajarishi Sinha | Sercan O Arik | Tomas Pfister
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Embeddings from Large Language Models (LLMs) have emerged as critical components in various applications, particularly for information retrieval. While high-dimensional embeddings generally demonstrate superior performance as they contain more salient information, their practical application is frequently hindered by elevated computational latency and the associated higher cost. To address these challenges, we propose Matryoshka-Adaptor, a novel tuning framework designed for the customization of LLM embeddings. Matryoshka-Adaptor facilitates substantial dimensionality reduction while maintaining comparable performance levels, thereby achieving a significant enhancement in computational efficiency and cost-effectiveness. Our framework directly modifies the embeddings from pre-trained LLMs which is designed to be seamlessly integrated with any LLM architecture, encompassing those accessible exclusively through black-box APIs. Also, it exhibits efficacy in both unsupervised and supervised learning settings. A rigorous evaluation conducted across a diverse corpus of English, multilingual, and multimodal datasets consistently reveals substantial gains with Matryoshka-Adaptor. Notably, with Google and OpenAI Embedding APIs, Matryoshka-Adaptor achieves a reduction in dimensionality ranging from two- to twelve-fold without compromising performance across multiple BEIR datasets.

Large language models (LLMs) have attracted great interest in many real-world applications; however, their “black-box” nature necessitates scalable and faithful explanations. Shapley values have matured as an explainability method for deep learning, but extending them to LLMs is difficult due to long input contexts and autoregressive output generation. We introduce , an efficient post-hoc explanation method incorporating LLM-specific techniques, which leads to significant runtime improvements: token-level explanations in minutes not hours, and document-level explanations within seconds. We demonstrate how such explanations can improve end-to-end performance of retrieval augmented generation by localizing important words within long documents and reranking passages collected by retrieval systems. On various open-domain question answering benchmarks, we show TextGenSHAP improves the retrieval recall and prediction accuracy significantly.

pdf bib abs
Effective Large Language Model Adaptation for Improved Grounding and Citation Generation
Xi Ye | Ruoxi Sun | Sercan Arik | Tomas Pfister
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large language models (LLMs) have achieved remarkable advancements in natural language understanding and generation. However, one major issue towards their widespread deployment in the real world is that they can generate “hallucinated” answers that are not factual.Towards this end, this paper focuses on improving LLMs by grounding their responses in retrieved passages and by providing citations. We propose a new framework, AGREE, Adaptation for GRounding EnhancEment, that improves the grounding from a holistic perspective. Our framework tunes LLMs to self-ground the claims in their responses and provide accurate citations to retrieved documents. This tuning on top of the pre-trained LLMs requires well-grounded responses (with citations) for paired queries, for which we introduce a method that can automatically construct such data from unlabeled queries. The self-grounding capability of tuned LLMs further grants them a test-time adaptation (TTA) capability that can actively retrieve passages to support the claims that have not been grounded, which iteratively improves the responses of LLMs. Across five datasets and two LLMs, our results show that the proposed tuning-based framework generates superior grounded responses with more accurate citations compared to prompting-based approaches and post-hoc citing-based approaches.

2023

A hallmark of modern large language models (LLMs) is their impressive general zero-shot and few-shot abilities, often elicited through in-context learning (ICL) via prompting. However, while highly coveted and being the most general, zero-shot performances in LLMs are still typically weaker due to the lack of guidance and the difficulty of applying existing automatic prompt design methods in general tasks when ground-truth labels are unavailable. In this study, we address this by presenting Universal Self-Adaptive Prompting (USP), an automatic prompt design approach specifically tailored for zero-shot learning (while compatible with few-shot). Requiring only a small amount of unlabeled data and an inference-only LLM, USP is highly versatile: to achieve universal prompting, USP categorizes a possible NLP task into one of the three possible task types and then uses a corresponding selector to select the most suitable queries and zero-shot model-generated responses as pseudo-demonstrations, thereby generalizing ICL to the zero-shot setup in a fully automated way. We evaluate USP with PaLM and PaLM 2 models and demonstrate performances that are considerably stronger than standard zero-shot baselines and often comparable to or even superior to few-shot baselines across more than 40 natural language understanding, natural language generation, and reasoning tasks.

pdf bib abs
Better Zero-Shot Reasoning with Self-Adaptive Prompting
Xingchen Wan | Ruoxi Sun | Hanjun Dai | Sercan Arik | Tomas Pfister
Findings of the Association for Computational Linguistics: ACL 2023

Modern large language models (LLMs) have demonstrated impressive capabilities at sophisticated tasks, often through step-by-step reasoning similar to humans. This is made possible by their strong few- and zero-shot abilities – they can effectively learn from a handful of handcrafted, completed responses (“in-context examples”), or are prompted to reason spontaneously through specially designed triggers. Nonetheless, some limitations have been observed. First, performance in the few-shot setting is sensitive to the choice of the examples, whose design requires significant human effort. Moreover, given the diverse downstream tasks of LLMs, it may be difficult or laborious to handcraft per-task labels. Second, while the zero-shot setting does not require handcrafting, its performance is limited due to the lack of guidance to the LLMs. To address these limitations, we propose Consistency-based Self-adaptive Prompting (COSP), a novel prompt design method for LLMs. Requiring neither handcrafted responses nor ground-truth labels, COSP selects and builds the set of examples from the LLM zero-shot outputs via carefully designed criteria combining consistency, diversity and repetition. In the zero-shot setting for three different LLMs, we show that using only LLM predictions, COSP significantly improves performance up to 15% compared to zero-shot baselines and matches or exceeds few-shot baselines at a range of reasoning tasks.

Text-to-SQL aims to automate the process of generating SQL queries on a database from natural language text. In this work, we propose “SQLPrompt”, tailored to improve the few-shot prompting capabilities of Text-to-SQL for Large Language Models (LLMs). Our methods include innovative prompt design, execution-based consistency decoding strategy which selects the SQL with the most consistent execution outcome among other SQL proposals, and a method that aims to improve performance by diversifying the SQL proposals during consistency selection with different prompt designs (“MixPrompt”) and foundation models (“MixLLMs”). We show that SQLPrompt outperforms previous approaches for in-context learning with zero labeled data by a large margin, closing the gap with finetuning state-of-the-art with thousands of labeled data.

Large language models (LLMs) have recently shown great advances in a variety of tasks, including natural language understanding and generation. However, their use in high-stakes decision-making scenarios is still limited due to the potential for errors. *Selective prediction* is a technique that can be used to improve the reliability of the LLMs by allowing them to abstain from making predictions when they are unsure of the answer. In this work, we propose a novel framework for adaptation with self-evaluation to improve the selective prediction performance of LLMs. Our framework is based on the idea of using parameter-efficient tuning to adapt the LLM to the specific task at hand while improving its ability to perform self-evaluation. We evaluate our method on a variety of question-answering (QA) datasets and show that it outperforms state-of-the-art selective prediction methods. For example, on the CoQA benchmark, our method improves the AUACC from 91.23% to 92.63% and improves the AUROC from 74.61% to 80.25%.