Training LLMs to be Better Text Embedders through Bidirectional Reconstruction

Chang Su, Dengliang Shi, Siyuan Huang, Jintao Du, Changhua Meng, Yu Cheng, Weiqiang Wang, Zhouhan Lin


Abstract
Large language models (LLMs) have increasingly been explored as powerful text embedders. Existing LLM-based text embedding approaches often leverage the embedding of the final token, typically a reserved special token such as ‘[EOS]‘. However, these tokens have not been intentionally trained to capture the semantics of the whole context, limiting their capacity as text embeddings, especially for retrieval and re-ranking tasks. We propose to add a new training stage before contrastive learning to enrich the semantics of the final token embedding. This stage employs bidirectional generative reconstruction tasks, namely EBQ2D (Embedding-Based Query-to-Document) and EBD2Q (Embedding-Based Document-to-Query), which interleave to anchor the ‘[EOS]‘ embedding and reconstruct either side of Query-Document pairs. Experimental results demonstrate that our additional training stage significantly improves LLM performance on the Massive Text Embedding Benchmark (MTEB), achieving new state-of-the-art results across different LLM base models and scales.
Anthology ID:
2025.emnlp-main.216
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4351–4369
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.216/
DOI:
Bibkey:
Cite (ACL):
Chang Su, Dengliang Shi, Siyuan Huang, Jintao Du, Changhua Meng, Yu Cheng, Weiqiang Wang, and Zhouhan Lin. 2025. Training LLMs to be Better Text Embedders through Bidirectional Reconstruction. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4351–4369, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Training LLMs to be Better Text Embedders through Bidirectional Reconstruction (Su et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.216.pdf
Checklist:
 2025.emnlp-main.216.checklist.pdf