Tan Yu

2025

Retrieval-augmented generation (RAG) offers a robust solution for developing enterprise internal virtual assistants by leveraging domain-specific knowledge and utilizing information from frequently updated corporate document repositories. In this work, we introduce the Enterprise-Knowledge RAG (EKRAG) dataset to benchmark RAG for enterprise knowledge question-answering (QA) across a diverse range of corporate documents, such as product releases, technical blogs, and financial reports. Using EKRAG, we systematically evaluate various retrieval models and strategies tailored for corporate content. We propose novel embedding-model (EM)-as-judge and ranking-model (RM)-as-judge approaches to assess answer quality in the context of enterprise information. Combining these with the existing LLM-as-judge method, we then comprehensively evaluate the correctness, relevance, and faithfulness of generated answers to corporate queries. Our extensive experiments shed light on optimizing RAG pipelines for enterprise knowledge QA, providing valuable guidance for practitioners. This work contributes to enhancing information retrieval and question-answering capabilities in corporate environments that demand high degrees of factuality and context-awareness.

2022

pdf bib abs
Cross-Lingual Cross-Modal Consolidation for Effective Multilingual Video Corpus Moment Retrieval
Jiaheng Liu | Tan Yu | Hanyu Peng | Mingming Sun | Ping Li
Findings of the Association for Computational Linguistics: NAACL 2022

Existing multilingual video corpus moment retrieval (mVCMR) methods are mainly based on a two-stream structure. The visual stream utilizes the visual content in the video to estimate the query-visual similarity, and the subtitle stream exploits the query-subtitle similarity. The final query-video similarity ensembles similarities from two streams. In our work, we pro- pose a simple and effective strategy termed as Cross-lingual Cross-modal Consolidation (C3 ) to improve mVCMR accuracy. We adopt the ensemble similarity as the teacher to guide the training of each stream, leading to a more powerful ensemble similarity. Meanwhile, we use the teacher for a specific language to guide the student for another language to exploit the complementary knowledge across languages. Ex- tensive experiments on mTVR dataset demonstrate the effectiveness of our C3 method.

2021

pdf bib abs
Inflate and Shrink:Enriching and Reducing Interactions for Fast Text-Image Retrieval
Haoliang Liu | Tan Yu | Ping Li
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

By exploiting the cross-modal attention, cross-BERT methods have achieved state-of-the-art accuracy in cross-modal retrieval. Nevertheless, the heavy text-image interactions in the cross-BERT model are prohibitively slow for large-scale retrieval. Late-interaction methods trade off retrieval accuracy and efficiency by exploiting cross-modal interaction only in the late stage, attaining a satisfactory retrieval speed. In this work, we propose an inflating and shrinking approach to further boost the efficiency and accuracy of late-interaction methods. The inflating operation plugs several codes in the input of the encoder to exploit the text-image interactions more thoroughly for higher retrieval accuracy. Then the shrinking operation gradually reduces the text-image interactions through knowledge distilling for higher efficiency. Through an inflating operation followed by a shrinking operation, both efficiency and accuracy of a late-interaction model are boosted. Systematic experiments on public benchmarks demonstrate the effectiveness of our inflating and shrinking approach.

pdf bib abs
Cross-lingual Cross-modal Pretraining for Multimodal Retrieval
Hongliang Fei | Tan Yu | Ping Li
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Recent pretrained vision-language models have achieved impressive performance on cross-modal retrieval tasks in English. Their success, however, heavily depends on the availability of many annotated image-caption datasets for pretraining, where the texts are not necessarily in English. Although we can utilize machine translation (MT) tools to translate non-English text to English, the performance still largely relies on MT’s quality and may suffer from high latency problems in real-world applications. This paper proposes a new approach to learn cross-lingual cross-modal representations for matching images and their relevant captions in multiple languages. We seamlessly combine cross-lingual pretraining objectives and cross-modal pretraining objectives in a unified framework to learn image and text in a joint embedding space from available English image-caption data, monolingual and parallel corpus. We show that our approach achieves SOTA performance in retrieval tasks on two multimodal multilingual image caption benchmarks: Multi30k with German captions and MSCOCO with Japanese captions.

Co-authors

Leiyang Leiyang 1

Haoliang Liu 1

Jiaheng Liu 1

Mmadugula Mmadugula 1

Venues

Fix data