Jinsong Ni
2025
Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval
Hao Sun
|
Yingyan Hou
|
Jiayan Guo
|
Bo Wang
|
Chunyu Yang
|
Jinsong Ni
|
Yan Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Document retrieval in real-world scenarios faces significant challenges due to diverse document formats and modalities. Traditional text-based approaches rely on tailored parsing techniques that disregard layout information and are prone to errors, while recent parsing-free visual methods often struggle to capture fine-grained textual semantics in text-rich scenarios. To address these limitations, we propose Unveil, a novel visual-textual embedding framework that effectively integrates textual and visual features for robust document representation. Through knowledge distillation, we transfer the semantic understanding capabilities from the visual-textual embedding model to a purely visual model, enabling efficient parsing-free retrieval while preserving semantic fidelity. Experimental results demonstrate that our visual-textual embedding method surpasses existing approaches, while knowledge distillation successfully bridges the performance gap between visual-textual and visual-only methods, improving both retrieval accuracy and efficiency.
Search
Fix author
Co-authors
- Jiayan Guo 1
- Yingyan Hou 1
- Hao Sun 1
- Bo Wang 1
- Chunyu Yang 1
- show all...
Venues
- acl1