Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering

J. Edward Hu; Abhinav Singh; Nils Holzenberger; Matt Post; Benjamin Van Durme

doi:10.18653/v1/K19-1005

Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering

J. Edward Hu, Abhinav Singh, Nils Holzenberger, Matt Post, Benjamin Van Durme

Abstract

Producing diverse paraphrases of a sentence is a challenging task. Natural paraphrase corpora are scarce and limited, while existing large-scale resources are automatically generated via back-translation and rely on beam search, which tends to lack diversity. We describe ParaBank 2, a new resource that contains multiple diverse sentential paraphrases, produced from a bilingual corpus using negative constraints, inference sampling, and clustering.We show that ParaBank 2 significantly surpasses prior work in both lexical and syntactic diversity while being meaning-preserving, as measured by human judgments and standardized metrics. Further, we illustrate how such paraphrastic resources may be used to refine contextualized encoders, leading to improvements in downstream tasks.

Anthology ID:: K19-1005
Volume:: Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)
Month:: November
Year:: 2019
Address:: Hong Kong, China
Venue:: CoNLL
SIG:: SIGNLL
Publisher:: Association for Computational Linguistics
Note:
Pages:: 44–54
Language:
URL:: https://aclanthology.org/K19-1005
DOI:: 10.18653/v1/K19-1005
Bibkey:
Cite (ACL):: J. Edward Hu, Abhinav Singh, Nils Holzenberger, Matt Post, and Benjamin Van Durme. 2019. Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 44–54, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):: Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering (Hu et al., CoNLL 2019)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-script-update/K19-1005.pdf

PDF Search