Unsupervised Partial Sentence Matching for Cited Text Identification
Kathryn Ricci, Haw-Shiuan Chang, Purujit Goyal, Andrew McCallum
Abstract
Given a citation in the body of a research paper, cited text identification aims to find the sentences in the cited paper that are most relevant to the citing sentence. The task is fundamentally one of sentence matching, where affinity is often assessed by a cosine similarity between sentence embeddings. However, (a) sentences may not be well-represented by a single embedding because they contain multiple distinct semantic aspects, and (b) good matches may not require a strong match in all aspects. To overcome these limitations, we propose a simple and efficient unsupervised method for cited text identification that adapts an asymmetric similarity measure to allow partial matches of multiple aspects in both sentences. On the CL-SciSumm dataset we find that our method outperforms a baseline symmetric approach, and, surprisingly, also outperforms all supervised and unsupervised systems submitted to past editions of CL-SciSumm Shared Task 1a.- Anthology ID:
- 2022.sdp-1.11
- Volume:
- Proceedings of the Third Workshop on Scholarly Document Processing
- Month:
- October
- Year:
- 2022
- Address:
- Gyeongju, Republic of Korea
- Venue:
- sdp
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 95–104
- Language:
- URL:
- https://aclanthology.org/2022.sdp-1.11
- DOI:
- Cite (ACL):
- Kathryn Ricci, Haw-Shiuan Chang, Purujit Goyal, and Andrew McCallum. 2022. Unsupervised Partial Sentence Matching for Cited Text Identification. In Proceedings of the Third Workshop on Scholarly Document Processing, pages 95–104, Gyeongju, Republic of Korea. Association for Computational Linguistics.
- Cite (Informal):
- Unsupervised Partial Sentence Matching for Cited Text Identification (Ricci et al., sdp 2022)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2022.sdp-1.11.pdf