Aiming beyond the Obvious: Identifying Non-Obvious Cases in Semantic Similarity Datasets

Nicole Peinelt; Maria Liakata; Dong Nguyen

doi:10.18653/v1/P19-1268

Aiming beyond the Obvious: Identifying Non-Obvious Cases in Semantic Similarity Datasets

Nicole Peinelt, Maria Liakata, Dong Nguyen

Abstract

Existing datasets for scoring text pairs in terms of semantic similarity contain instances whose resolution differs according to the degree of difficulty. This paper proposes to distinguish obvious from non-obvious text pairs based on superficial lexical overlap and ground-truth labels. We characterise existing datasets in terms of containing difficult cases and find that recently proposed models struggle to capture the non-obvious cases of semantic similarity. We describe metrics that emphasise cases of similarity which require more complex inference and propose that these are used for evaluating systems for semantic similarity.

Anthology ID:: P19-1268
Volume:: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Month:: July
Year:: 2019
Address:: Florence, Italy
Editors:: Anna Korhonen, David Traum, Lluís Màrquez
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2792–2798
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/P19-1268/
DOI:: 10.18653/v1/P19-1268
Bibkey:
Cite (ACL):: Nicole Peinelt, Maria Liakata, and Dong Nguyen. 2019. Aiming beyond the Obvious: Identifying Non-Obvious Cases in Semantic Similarity Datasets. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2792–2798, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):: Aiming beyond the Obvious: Identifying Non-Obvious Cases in Semantic Similarity Datasets (Peinelt et al., ACL 2019)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/P19-1268.pdf
Supplementary:: P19-1268.Supplementary.pdf
Video:: https://preview.aclanthology.org/fix-sig-urls/P19-1268.mp4
Code: wuningxi/LexSim

PDF Cite Search Code Supplementary Video Fix data