Zhanshuo Cao

2025

pdf bib abs
Length-Induced Embedding Collapse in PLM-based Models
Yuqi Zhou | Sunhao Dai | Zhanshuo Cao | Xiao Zhang | Jun Xu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Text embeddings from PLM-based models enable a wide range of applications, yet their performance often degrades on longer texts. In this paper, we introduce a phenomenon we call Length Collapse, where embeddings of longer texts tend to cluster together. This clustering results in a distributional inconsistency between the embeddings of short and long texts. We further investigate how these differences contribute to the performance decline observed with longer texts across various downstream tasks. Through a rigorous theoretical analysis of the self-attention mechanism, which acts as a low-pass filter in PLM-based models, we demonstrate that as text length increases, the strength of low-pass filtering intensifies, causing embeddings to retain more low-frequency components. As a result, input token features become more similar, leading to clustering and ultimately the collapse of embeddings for longer texts. To address this issue, we propose a simple method, TempScale, which mitigates the Length Collapse phenomenon. By narrowing the gap in low-pass filtering rates between long and short texts, TempScale ensures more consistent embeddings across different text lengths. This approach leads to performance improvements of 0.94% on MTEB and 1.10% on LongEmbed, which focuses specifically on long-context retrieval, providing strong evidence for the validity of our analysis. The source code is available at bluehttps://github.com/Yuqi-Zhou/Length_Collapse.

Unlike traditional search engines that present ranked lists of webpages, generative search engines rely solely on in-line citations as the key gateway to original real-world webpages, making it crucial to examine whether LLM-generated citations have biases—particularly for politically sensitive queries. To investigate this, we first construct AllSides-2024, a new dataset comprising the latest real-world news articles (Jan. 2024 - Dec. 2024) labeled with left- or right-leaning stances. Through systematic evaluations, we find that LLMs exhibit a consistent tendency to cite left-leaning sources at notably higher rates compared to traditional retrieval systems (e.g., BM25 and dense retrievers). Controlled experiments further reveal that this bias arises from a preference for media outlets identified as left-leaning, rather than for left-oriented content itself. Meanwhile, our findings show that while LLMs struggle to infer political bias from news content alone, they can almost perfectly recognize the political orientation of media outlets based on their names. These insights highlight the risk that, in the era of generative search engines, information exposure may be disproportionately shaped by specific media outlets, potentially shaping public perception and decision-making.

Co-authors

Wenjie Wang 1

Xiao Zhang (张晓) 1

Yuqi Zhou 1

Venues

acl1
emnlp1

Fix author