Shi Yu


2024

pdf
Fusion-in-T5: Unifying Variant Signals for Simple and Effective Document Ranking with Attention Fusion
Shi Yu | Chenghao Fan | Chenyan Xiong | David Jin | Zhiyuan Liu | Zhenghao Liu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Common document ranking pipelines in search systems are cascade systems that involve multiple ranking layers to integrate different information step-by-step. In this paper, we propose a novel re-ranker Fusion-in-T5 (FiT5), which integrates text matching information, ranking features, and global document information into one single unified model via templated-based input and global attention. Experiments on passage ranking benchmarks MS MARCO and TREC DL show that FiT5, as one single model, significantly improves ranking performance over complex cascade pipelines. Analysis finds that through attention fusion, FiT5 jointly utilizes various forms of ranking information via gradually attending to related documents and ranking features, and improves the detection of subtle nuances. Our code is open-sourced at https://github.com/OpenMatch/FiT5 . Keywords: document ranking, attention, fusion

2023

pdf
Structure-Aware Language Model Pretraining Improves Dense Retrieval on Structured Data
Xinze Li | Zhenghao Liu | Chenyan Xiong | Shi Yu | Yu Gu | Zhiyuan Liu | Ge Yu
Findings of the Association for Computational Linguistics: ACL 2023

This paper presents Structure Aware Dense Retrieval (SANTA) model, which encodes user queries and structured data in one universal embedding space for retrieving structured data. SANTA proposes two pretraining methods to make language models structure-aware and learn effective representations for structured data: 1) Structured Data Alignment, which utilizes the natural alignment relations between structured data and unstructured data for structure-aware pretraining. It contrastively trains language models to represent multi-modal text data and teaches models to distinguish matched structured data for unstructured texts. 2) Masked Entity Prediction, which designs an entity-oriented mask strategy and asks language models to fill in the masked entities. Our experiments show that SANTA achieves state-of-the-art on code search and product search and conducts convincing results in the zero-shot setting. SANTA learns tailored representations for multi-modal text data by aligning structured and unstructured data pairs and capturing structural semantics by masking and predicting entities in the structured data. All codes are available at https://github.com/OpenMatch/OpenMatch.

pdf
Augmentation-Adapted Retriever Improves Generalization of Language Models as Generic Plug-In
Zichun Yu | Chenyan Xiong | Shi Yu | Zhiyuan Liu
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Retrieval augmentation can aid language models (LMs) in knowledge-intensive tasks by supplying them with external information. Prior works on retrieval augmentation usually jointly fine-tune the retriever and the LM, making them closely coupled. In this paper, we explore the scheme of generic retrieval plug-in: the retriever is to assist target LMs that may not be known beforehand or are unable to be fine-tuned together. To retrieve useful documents for unseen target LMs, we propose augmentation-adapted retriever (AAR), which learns LM’s preferences obtained from a known source LM. Experiments on the MMLU and PopQA datasets demonstrate that our AAR trained with a small source LM is able to significantly improve the zero-shot generalization of larger target LMs ranging from 250M Flan-T5 to 175B InstructGPT. Further analysis indicates that the preferences of different LMs overlap, enabling AAR trained with a single source LM to serve as a generic plug-in for various target LMs. Our code is open-sourced at https://github.com/OpenMatch/Augmentation-Adapted-Retriever.

2022

pdf
Speech Aerodynamics Database, Tools and Visualisation
Shi Yu | Clara Ponchard | Roland Trouville | Sergio Hassid | Didier Demolin
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Aerodynamic processes underlie the characteristics of the acoustic signal of speech sounds. The aerodynamics of speech give insights on acoustic outcome and help explain the mechanisms of speech production. This database was designed during an ARC project ”Dynamique des systèmes phonologiques” in which the study of aerodynamic constraints on speech production was an important target. Data were recorded between 1996 and 1999 at the Erasmus Hospital (Hôpital Erasme) of Université Libre de Bruxelles, Belgium and constitute one of the few datasets available on direct measurement of subglottal pressure and other aerodynamic parameters. The goal was to obtain a substantial amount of data with simultaneous recording, in various context, of the speech acoustic signal, subglottal pressure (Ps), intraoral pressure (Po), oral airflow (Qo) and nasal airflow (Qn). This database contains recordings of 2 English, 1 Amharic, and 7 French speakers and is provided with data conversion and visualisation tools. Another aim of this project was to obtain some reference values of the aerodynamics of speech production for female and male speakers uttering different types of segments and sentences in French.

pdf
MIC: A Multi-task Interactive Curation Tool
Shi Yu | Mingfeng Yang | Jerrod Parker | Stephen Brock
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

This paper introduces MIC, a Multi-task Interactive Curation tool, a human-machine collaborative curation tool for multiple NLP tasks. The tool aims to borrow recent advances in literature to solve pain-points in real NLP tasks. Firstly, it supports multiple projects with multiple users which enables collaborative annotations. Secondly, MIC allows easy integration of pre-trained models, rules, and dictionaries to auto label the text and speed up the labeling process. Thirdly, MIC supports annotation at different scales (span of characters and words, tokens and lines, or document) and different types (free text, sentence labels, entity labels, and relationship triplets) with easy GUI operations.

2021

pdf
Named Entity Recognition through Deep Representation Learning and Weak Supervision
Jerrod Parker | Shi Yu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2018

pdf
Sign Languages and the Online World Online Dictionaries & Lexicostatistics
Shi Yu | Carlo Geraci | Natasha Abner
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)