Ziqi Dai
2025
Towards Text-Image Interleaved Retrieval
Xin Zhang
|
Ziqi Dai
|
Yongqi Li
|
Yanzhao Zhang
|
Dingkun Long
|
Pengjun Xie
|
Meishan Zhang
|
Jun Yu
|
Wenjie Li
|
Min Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Current multimodal information retrieval studies mainly focus on single-image inputs, which limits real-world applications involving multiple images and text-image interleaved content. In this work, we introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences, and the model is required to understand the semantics from the interleaved context for effective retrieval. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. To explore the task, we adapt several off-the-shelf retrievers and build a dense baseline by interleaved multimodal large language model (MLLM). We then propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity, to address the challenge of excessive visual tokens in MLLM-based TIIR models. Experiments demonstrate that simple adaption of existing models does not consistently yield effective results. Our MME achieves significant improvements over the baseline by substantially fewer visual tokens. We provide extensive analysis and will release the dataset and code to facilitate future research.
2024
mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval
Xin Zhang
|
Yanzhao Zhang
|
Dingkun Long
|
Wen Xie
|
Ziqi Dai
|
Jialong Tang
|
Huan Lin
|
Baosong Yang
|
Pengjun Xie
|
Fei Huang
|
Meishan Zhang
|
Wenjie Li
|
Min Zhang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
We present systematic efforts in building long-context multilingual text representation model (TRM) and reranker from scratch for text retrieval. We first introduce a text encoder (base size) enhanced with RoPE and unpadding, pre-trained in a native 8192-token context (longer than 512 of previous multilingual encoders). Then we construct a hybrid TRM and a cross-encoder reranker by contrastive learning. Evaluations show that our text encoder outperforms the same-sized previous state-of-the-art XLM-R. Meanwhile, our TRM and reranker match the performance of large-sized state-of-the-art BGE-M3 models and achieve better results on long-context retrieval benchmarks. Further analysis demonstrate that our proposed models exhibit higher efficiency during both training and inference. We believe their efficiency and effectiveness could benefit various researches and industrial applications.
Search
Fix author
Co-authors
- Wenjie Li 2
- Dingkun Long 2
- Pengjun Xie 2
- Xin Zhang 2
- Yanzhao Zhang 2
- show all...