2023
pdf
abs
RobustQA: Benchmarking the Robustness of Domain Adaptation for Open-Domain Question Answering
Rujun Han
|
Peng Qi
|
Yuhao Zhang
|
Lan Liu
|
Juliette Burger
|
William Yang Wang
|
Zhiheng Huang
|
Bing Xiang
|
Dan Roth
Findings of the Association for Computational Linguistics: ACL 2023
Open-domain question answering (ODQA) is a crucial task in natural language processing. A typical ODQA system relies on a retriever module to select relevant contexts from a large corpus for a downstream reading comprehension model. Existing ODQA datasets consist mainly of Wikipedia corpus, and are insufficient to study models’ generalizability across diverse domains as models are trained and evaluated on the same genre of data. We propose **RobustQA**, a novel benchmark consisting of datasets from 8 different domains, which facilitates the evaluation of ODQA’s domain robustness. To build **RobustQA**, we annotate QA pairs in retrieval datasets with rigorous quality control. We further examine improving QA performances by incorporating unsupervised learning methods with target-domain corpus and adopting large generative language models. These methods can effectively improve model performances on **RobustQA**. However, experimental results demonstrate a significant gap from in-domain training, suggesting that **RobustQA** is a challenging benchmark to evaluate ODQA domain robustness.
pdf
abs
Improving Cross-task Generalization of Unified Table-to-text Models with Compositional Task Configurations
Jifan Chen
|
Yuhao Zhang
|
Lan Liu
|
Rui Dong
|
Xinchi Chen
|
Patrick Ng
|
William Yang Wang
|
Zhiheng Huang
Findings of the Association for Computational Linguistics: ACL 2023
There has been great progress in unifying various table-to-text tasks using a single encoder-decoder model trained via multi-task learning (Xie et al., 2022).However, existing methods typically encode task information with a simple dataset name as a prefix to the encoder.This not only limits the effectiveness of multi-task learning, but also hinders the model’s ability to generalize to new domains or tasks that were not seen during training, which is crucial for real-world applications.In this paper, we propose compositional task configurations, a set of prompts prepended to the encoder to improve cross-task generalization of unified models.We design the task configurations to explicitly specify the task type, as well as its input and output types.We show that this not only allows the model to better learn shared knowledge across different tasks at training, but also allows us to control the model by composing new configurations that apply novel input-output combinations in a zero-shot manner.We demonstrate via experiments over ten table-to-text tasks that our method outperforms the UnifiedSKG baseline by noticeable margins in both in-domain and zero-shot settings, with average improvements of +0.5 and +12.6 from using a T5-large backbone, respectively.
pdf
abs
Hybrid Hierarchical Retrieval for Open-Domain Question Answering
Manoj Ghuhan Arivazhagan
|
Lan Liu
|
Peng Qi
|
Xinchi Chen
|
William Yang Wang
|
Zhiheng Huang
Findings of the Association for Computational Linguistics: ACL 2023
Retrieval accuracy is crucial to the performance of open-domain question answering (ODQA) systems. Recent work has demonstrated that dense hierarchical retrieval (DHR), which retrieves document candidates first and then relevant passages from the refined document set, can significantly outperform the single stage dense passage retriever (DPR). While effective, this approach requires document structure information to learn document representation and is hard to adopt to other domains without this information. Additionally, the dense retrievers tend to generalize poorly on out-of-domain data comparing with sparse retrievers such as BM25. In this paper, we propose Hybrid Hierarchical Retrieval (HHR) to address the existing limitations. Instead of relying solely on dense retrievers, we can apply sparse retriever, dense retriever, and a combination of them in both stages of document and passage retrieval. We perform extensive experiments on ODQA benchmarks and observe that our framework not only brings in-domain gains, but also generalizes better to zero-shot TriviaQA and Web Questions datasets with an average of 4.69% improvement on recall@100 over DHR. We also offer practical insights to trade off between retrieval accuracy, latency, and storage cost. The code is available on github.