Hai An Vu
2026
MIPIC: Matryoshka Representation Learning via Self-Distilled Intra-Relational and Progressive Information Chaining
Phung Gia Huy | Hai An Vu | Minh-Phuc Truong | Thang Duc Tran | Linh Ngo Van | Thanh Hong Nguyen | Trung Le
Findings of the Association for Computational Linguistics: ACL 2026
Phung Gia Huy | Hai An Vu | Minh-Phuc Truong | Thang Duc Tran | Linh Ngo Van | Thanh Hong Nguyen | Trung Le
Findings of the Association for Computational Linguistics: ACL 2026
Representation learning is fundamental to NLP, but building embeddings that work well at different computational budgets is challenging. Matryoshka Representation Learning (MRL) offers a flexible inference paradigm through nested embeddings; however, learning such structures requires explicit coordination of how information is arranged across embedding dimensionality and model depth. In this work, we propose MIPIC (Matryoshka Representation Learning via Self-Distilled Intra-Relational Alignment and Progressive Information Chaining), a unified training framework designed to produce structurally coherent and semantically compact Matryoshka representations. MIPIC promotes cross-dimensional structural consistency through Self-Distilled Intra-Relational Alignment (SIA), which aligns token-level geometric and attention-driven relations between full and truncated representations using top-k CKA self-distillation. Complementarily, it enables depth-wise semantic consolidation via Progressive Information Chaining (PIC), a scaffolded alignment strategy that incrementally transfers mature task semantics from deeper layers into earlier layers. Extensive experiments on STS, NLI, and classification benchmarks (spanning models from TinyBERT to BGEM3, Qwen3) demonstrate that MIPIC yields Matryoshka representations that are highly competitive across all capacities, with significant performance advantages observed under extreme low-dimensional.
2025
EMO: Embedding Model Distillation via Intra-Model Relation and Optimal Transport Alignments
Minh-Phuc Truong | Hai An Vu | Tu Vu | Nguyen Thi Ngoc Diep | Linh Ngo Van | Thien Huu Nguyen | Trung Le
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Minh-Phuc Truong | Hai An Vu | Tu Vu | Nguyen Thi Ngoc Diep | Linh Ngo Van | Thien Huu Nguyen | Trung Le
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Knowledge distillation (KD) is crucial for compressing large text embedding models, but faces challenges when teacher and student models use different tokenizers (Cross-Tokenizer KD - CTKD). Vocabulary mismatches impede the transfer of relational knowledge encoded in deep representations, such as hidden states and attention matrices, which are vital for producing high-quality embeddings. Existing CTKD methods often focus on direct output alignment, neglecting this crucial structural information. We propose a novel framework tailored for CTKD embedding model distillation. We first map tokens one-to-one via Minimum Edit Distance (MinED). Then, we distill intra-model relational knowledge by aligning attention matrix patterns using Centered Kernel Alignment, focusing on the top-m most important tokens of the directly mapped tokens. Simultaneously, we align final hidden states via Optimal Transport with Importance-Scored Mass Assignment, which emphasizes semantically important token representations, based on importance scores derived from attention weights. We evaluate distillation from state-of-the-art embedding models (e.g., LLM2Vec, BGE) to a Bert-base-uncased model on embedding-reliant tasks such as text classification, sentence pair classification, and semantic textual similarity. Our proposed framework significantly outperforms existing CTKD baselines. By preserving attention structure and prioritizing key representations, our approach yields smaller, high-fidelity embedding models despite tokenizer differences.