On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks

Stephen Mussmann; Robin Jia; Percy Liang

doi:10.18653/v1/2020.findings-emnlp.305

On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks

Stephen Mussmann, Robin Jia, Percy Liang

Abstract

Many pairwise classification tasks, such as paraphrase detection and open-domain question answering, naturally have extreme label imbalance (e.g., 99.99% of examples are negatives). In contrast, many recent datasets heuristically choose examples to ensure label balance. We show that these heuristics lead to trained models that generalize poorly: State-of-the art models trained on QQP and WikiQA each have only 2.4% average precision when evaluated on realistically imbalanced test data. We instead collect training data with active learning, using a BERT-based embedding model to efficiently retrieve uncertain points from a very large pool of unlabeled utterance pairs. By creating balanced training data with more informative negative examples, active learning greatly improves average precision to 32.5% on QQP and 20.1% on WikiQA.

Anthology ID:: 2020.findings-emnlp.305
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2020
Month:: November
Year:: 2020
Address:: Online
Editors:: Trevor Cohn, Yulan He, Yang Liu
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3400–3413
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2020.findings-emnlp.305/
DOI:: 10.18653/v1/2020.findings-emnlp.305
Bibkey:
Cite (ACL):: Stephen Mussmann, Robin Jia, and Percy Liang. 2020. On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3400–3413, Online. Association for Computational Linguistics.
Cite (Informal):: On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks (Mussmann et al., Findings 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2020.findings-emnlp.305.pdf
Code: worksheets/0x39ba5559
Data: GLUE, WikiQA

PDF Cite Search Code Fix data