TopGuNN: Fast NLP Training Data Augmentation using Large Corpora

Rebecca Iglesias-Flores; Megha Mishra; Ajay Patel; Akanksha Malhotra; Reno Kriz; Martha Palmer; Chris Callison-Burch

doi:10.18653/v1/2021.dash-1.14

TopGuNN: Fast NLP Training Data Augmentation using Large Corpora

Rebecca Iglesias-Flores, Megha Mishra, Ajay Patel, Akanksha Malhotra, Reno Kriz, Martha Palmer, Chris Callison-Burch

Abstract

Acquiring training data for natural language processing systems can be expensive and time-consuming. Given a few training examples crafted by experts, large corpora can be mined for thousands of semantically similar examples that provide useful variability to improve model generalization. We present TopGuNN, a fast contextualized k-NN retrieval system that can efficiently index and search over contextual embeddings generated from large corpora. TopGuNN is demonstrated for a training data augmentation use case over the Gigaword corpus. Using approximate k-NN and an efficient architecture, TopGuNN performs queries over an embedding space of 4.63TB (approximately 1.5B embeddings) in less than a day.

Anthology ID:: 2021.dash-1.14
Volume:: Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances
Month:: June
Year:: 2021
Address:: Online
Editors:: Eduard Dragut, Yunyao Li, Lucian Popa, Slobodan Vucetic
Venue:: DaSH
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 86–101
Language:
URL:: https://aclanthology.org/2021.dash-1.14
DOI:: 10.18653/v1/2021.dash-1.14
Bibkey:
Cite (ACL):: Rebecca Iglesias-Flores, Megha Mishra, Ajay Patel, Akanksha Malhotra, Reno Kriz, Martha Palmer, and Chris Callison-Burch. 2021. TopGuNN: Fast NLP Training Data Augmentation using Large Corpora. In Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances, pages 86–101, Online. Association for Computational Linguistics.
Cite (Informal):: TopGuNN: Fast NLP Training Data Augmentation using Large Corpora (Iglesias-Flores et al., DaSH 2021)
Copy Citation:
PDF:: https://preview.aclanthology.org/naacl-24-ws-corrections/2021.dash-1.14.pdf
Code: penn-topgunn/topgunn

PDF Search Code