Harish Karnick

2021

pdf abs
SumPubMed: Summarization Dataset of PubMed Scientific Articles
Vivek Gupta | Prerna Bharti | Pegah Nokhiz | Harish Karnick
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop

Most earlier work on text summarization is carried out on news article datasets. The summary in these datasets is naturally located at the beginning of the text. Hence, a model can spuriously utilize this correlation for summary generation instead of truly learning to summarize. To address this issue, we constructed a new dataset, SumPubMed , using scientific articles from the PubMed archive. We conducted a human analysis of summary coverage, redundancy, readability, coherence, and informativeness on SumPubMed . SumPubMed is challenging because (a) the summary is distributed throughout the text (not-localized on top), and (b) it contains rare domain-specific scientific terms. We observe that seq2seq models that adequately summarize news articles struggle to summarize SumPubMed . Thus, SumPubMed opens new avenues for the future improvement of models as well as the development of new evaluation metrics.

2018

pdf abs
Unsupervised Semantic Abstractive Summarization
Shibhansh Dohare | Vivek Gupta | Harish Karnick
Proceedings of ACL 2018, Student Research Workshop

Automatic abstractive summary generation remains a significant open problem for natural language processing. In this work, we develop a novel pipeline for Semantic Abstractive Summarization (SAS). SAS, as introduced by Liu et. al. (2015) first generates an AMR graph of an input story, through which it extracts a summary graph and finally, creates summary sentences from this summary graph. Compared to earlier approaches, we develop a more comprehensive method to generate the story AMR graph using state-of-the-art co-reference resolution and Meta Nodes. Which we then use in a novel unsupervised algorithm based on how humans summarize a piece of text to extract the summary sub-graph. Our algorithm outperforms the state of the art SAS method by 1.7% F1 score in node prediction.

2017

pdf abs
SCDV : Sparse Composite Document Vectors using soft clustering over distributional representations
Dheeraj Mekala | Vivek Gupta | Bhargavi Paranjape | Harish Karnick
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We present a feature vector formation technique for documents - Sparse Composite Document Vector (SCDV) - which overcomes several shortcomings of the current distributional paragraph vector representations that are widely used for text representation. In SCDV, word embeddings are clustered to capture multiple semantic contexts in which words occur. They are then chained together to form document topic-vectors that can express complex, multi-topic documents. Through extensive experiments on multi-class and multi-label classification tasks, we outperform the previous state-of-the-art method, NTSG. We also show that SCDV embeddings perform well on heterogeneous tasks like Topic Coherence, context-sensitive Learning and Information Retrieval. Moreover, we achieve a significant reduction in training and prediction times compared to other representation methods. SCDV achieves best of both worlds - better performance with lower time and space complexity.

2016

pdf
Automatic tagging and retrieval of E-Commerce products based on visual features
Vasu Sharma | Harish Karnick
Proceedings of the NAACL Student Research Workshop

pdf abs
Product Classification in E-Commerce using Distributional Semantics
Vivek Gupta | Harish Karnick | Ashendra Bansal | Pradhuman Jhala
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Product classification is the task of automatically predicting a taxonomy path for a product in a predefined taxonomy hierarchy given a textual product description or title. For efficient product classification we require a suitable representation for a document (the textual description of a product) feature vector and efficient and fast algorithms for prediction.To address the above challenges, we propose a new distributional semantics representation for document vector formation. We also develop a new two-level ensemble approach utilising (with respect to the taxonomy tree) path-wise, node-wise and depth-wise classifiers to reduce error in the final product classification task. Our experiments show the effectiveness of the distributional representation and the ensemble approach on data sets from a leading e-commerce platform and achieve improved results on various evaluation metrics compared to earlier approaches.

A parser is described here based on the Cocke-Young-Kassami algorithm which uses immediate dominance and linear precedence rules together with various feature inheritance conventions. The meta rules in the grammar are not applied beforehand but only when needed. This ensures that the rule set is kept to a minimum. At the same time, determining what rule to expand by applying which meta-rule is done in an efficient manner using the meta-rule reference table. Since this table is generated during “compilation” stage, its generation does not add to parsing time.