Diffusion-Pretrained Dense and Contextual Embeddings

Sedigheh Eslami, Maksim Gaiduk, Markus Krimmel, Louis Mark Milliken, Bo Wang, Denis Bykov


Abstract
We introduce pplx-embed, a family of multilingual embedding models that employ multi-stage contrastive learning on a diffusion-pretrained language model backbone for web-scale retrieval.By leveraging bidirectional attention through diffusion-based pretraining, our models capture comprehensive bidirectional context within passages, enabling the use of mean pooling to better preserve global context across long documents.We release pplx-embed-v1 for standard retrieval, and pplx-embed-context-v1 for contextualized embeddings that incorporate global document context into passage representations.pplx-embed-v1 achieves competitive performance on the MTEB(Multilingual, v2), MTEB(Code), BERGEN, and ToolRet retrieval benchmarks, while pplx-embed-context-v1 sets new records on the ConTEB benchmark.
Anthology ID:
2026.acl-industry.69
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Yunyao Li, Georg Rehm, Mei Tu
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
990–1004
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-industry.69/
DOI:
Bibkey:
Cite (ACL):
Sedigheh Eslami, Maksim Gaiduk, Markus Krimmel, Louis Mark Milliken, Bo Wang, and Denis Bykov. 2026. Diffusion-Pretrained Dense and Contextual Embeddings. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), pages 990–1004, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Diffusion-Pretrained Dense and Contextual Embeddings (Eslami et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-industry.69.pdf