Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings

Shiyu Li; Yang Tang; Ruijie Liu; Shi-Zhe Chen; Xi Chen

doi:10.18653/v1/2025.emnlp-main.758

Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings

Shiyu Li, Yang Tang, Ruijie Liu, Shi-Zhe Chen, Xi Chen

Abstract

Large language models (LLMs) have recently demonstrated excellent performance in text embedding tasks. Previous work usually use LoRA to fine-tune existing LLMs, which are limited by the data and training gap between LLMs and embedding models. In this work, we introduce Conan-embedding-v2, a new 1.4B-parameter LLM trained from scratch and fine-tuned as a text embedder. First, we add news data and multilingual pairs for LLM pretraining to bridge the data gap. Based on this, we propose a cross-lingual retrieval dataset that enables the LLM to better integrate embeddings across different languages. Second, whereas LLMs use a causal mask with token-level loss, embedding models use a bidirectional mask with sentence-level loss. This training gap makes full fine-tuning less effective than LoRA. We introduce a soft-masking mechanism to gradually transition between these two types of masks, enabling the model to learn more comprehensive representations. Based on this, we propose a dynamic hard negative mining method that exposes the model to more difficult negative examples throughout the training process. Being intuitive and effective, with only approximately 1.4B parameters, Conan-embedding-v2 achieves SOTA performance on both the Massive Text Embedding Benchmark (MTEB) and Chinese MTEB (May 19, 2025).

Anthology ID:: 2025.emnlp-main.758
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 15011–15027
Language:
URL:: https://preview.aclanthology.org/ingest-luhme/2025.emnlp-main.758/
DOI:: 10.18653/v1/2025.emnlp-main.758
Bibkey:
Cite (ACL):: Shiyu Li, Yang Tang, Ruijie Liu, Shi-Zhe Chen, and Xi Chen. 2025. Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15011–15027, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings (Li et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-luhme/2025.emnlp-main.758.pdf
Checklist:: 2025.emnlp-main.758.checklist.pdf

PDF Cite Search Checklist Fix data