Yueyang Su


2026

Large embedding models have become the backbone of modern retrieval systems, offering strong semantic representations at the cost of substantial storage and computation. While recent work explores quantizing embeddings into discrete document identifiers for generative retrieval, most existing approaches rely on Euclidean quantization, which is poorly aligned with the angular geometry induced by contrastive embedding training and often requires long identifier sequences to preserve semantic fidelity. In this work, we propose Hyperspherical Householder Quantization (HHQ), a geometry-aware distillation method that compresses large embeddings into short discrete representations via iterative Householder transformations on the unit hypersphere. By explicitly preserving cosine similarity at each step, HHQ distills semantic structure into compact identifiers that remain faithful to the original embedding space. To support reliable generation of these identifiers, we introduce constrained supervised fine-tuning and tree-aware dynamic masking to enforce structural validity during training and inference. Experiments on NQ and MS MARCO show that HHQ achieves competitive or superior retrieval performance using only five tokens per document, substantially reducing decoding cost while retaining strong semantic retrieval accuracy.
Long context large language models exhibit the “lost in the middle” problem, where models struggle to effectively utilize information located in the middle of long contexts. Although existing workflow-based long context methods (e.g., RAG) alleviate this problem and perform well on specific datasets, can their effectiveness generalize to all types of datasets? In this work, we systematically investigate the cross-dataset generalization of long context methods. Our evaluation reveals that these methods are not universally effective. Such substantial performance variability underscores the risks of performance degradation associated with the indiscriminate application of long context methods. We investigated the reason for the failure of long context methods. We found that the intrinsic decomposition mechanisms of long context methods hinder context dependency modeling, causing these methods to suffer performance declines on documents with strong context dependency. To address this issue, We propose CoDaR (**Co**ntext **D**ependency-**a**ware **R**outing), a training-free adaptive routing strategy. By analyzing the context dependency strength of documents, CoDaR adaptively invokes long context methods, thereby significantly enhancing their overall robustness across different types of datasets.

2025

Generative large language models ( LLMs) have achieved remarkable success in various industrial applications, owing to their promising In-Context Learning capabilities. However, the issue of long context in complex tasks poses a significant barrier to their wider adoption, manifested in two main aspects: (i) The excessively long context leads to high costs and inference delays. (ii) A substantial amount of task-irrelevant information introduced by long contexts exacerbates the “lost in the middle” problem. Existing methods compress context by removing redundant tokens using metrics such as self-information or perplexity ( PPL ), which is inconsistent with the objective of retaining the most important tokens when conditioning on a given query. In this study, we introduce information bottleneck theory (IB) to model the problem, offering a novel perspective that thoroughly addresses the essential properties required for context compression. Additionally, we propose a cross-attention-based approach to approximate mutual information in IB, which can be flexibly replaced with suitable alternatives in different scenarios. Extensive experiments on four datasets demonstrate that our method achieves a 25% increase in compression rate compared to the state-of-the-art, while maintaining question answering performance. In particular, the context compressed by our method even outperform the full context in some cases.
With the extensive use of large language models, automatically generating QA datasets for domain-specific fine-tuning has become crucial. However, considering the multifaceted demands for readability, diversity, and comprehensiveness of QA data, current methodologies fall short in producing high-quality QA datasets. Moreover, the dependence of existing evaluation metrics on ground truth labels further exacerbates the challenges associated with the selection of QA data. In this paper, we introduce a novel method for QA data generation, denoted as MDPO. We proposes a set of unsupervised evaluation metrics for QA data, enabling multidimensional assessment based on the relationships among context,question and answer. Furthermore, leveraging these metrics, we implement a customized direct preference optimization process that guides large language models to produce high-quality and domain-specific QA pairs. Empirical results on public datasets indicate that MDPO’s performance substantially surpasses that of state-of-the-art methods.