Purujit Goyal

2022

pdf abs
Unsupervised Partial Sentence Matching for Cited Text Identification
Kathryn Ricci | Haw-Shiuan Chang | Purujit Goyal | Andrew McCallum
Proceedings of the Third Workshop on Scholarly Document Processing

Given a citation in the body of a research paper, cited text identification aims to find the sentences in the cited paper that are most relevant to the citing sentence. The task is fundamentally one of sentence matching, where affinity is often assessed by a cosine similarity between sentence embeddings. However, (a) sentences may not be well-represented by a single embedding because they contain multiple distinct semantic aspects, and (b) good matches may not require a strong match in all aspects. To overcome these limitations, we propose a simple and efficient unsupervised method for cited text identification that adapts an asymmetric similarity measure to allow partial matches of multiple aspects in both sentences. On the CL-SciSumm dataset we find that our method outperforms a baseline symmetric approach, and, surprisingly, also outperforms all supervised and unsupervised systems submitted to past editions of CL-SciSumm Shared Task 1a.

2021

pdf abs
Box Embeddings: An open-source library for representation learning using geometric structures
Tejas Chheda | Purujit Goyal | Trang Tran | Dhruvesh Patel | Michael Boratko | Shib Sankar Dasgupta | Andrew McCallum
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

A fundamental component to the success of modern representation learning is the ease of performing various vector operations. Recently, objects with more geometric structure (eg. distributions, complex or hyperbolic vectors, or regions such as cones, disks, or boxes) have been explored for their alternative inductive biases and additional representational capacity. In this work, we introduce Box Embeddings, a Python library that enables researchers to easily apply and extend probabilistic box embeddings. Fundamental geometric operations on boxes are implemented in a numerically stable way, as are modern approaches to training boxes which mitigate gradient sparsity. The library is fully open source, and compatible with both PyTorch and TensorFlow, which allows existing neural network layers to be replaced with or transformed into boxes easily. In this work, we present the implementation details of the fundamental components of the library, and the concepts required to use box representations alongside existing neural network architectures.

Co-authors

Dhruvesh Patel 1

Michael Boratko 1

Shib Sankar Dasgupta 1

Venues

sdp1
emnlp1