Vinayshekhar Bannihatti Kumar

Also published as: Vinayshekhar Bannihatti Kumar


A Tale of Two Regulatory Regimes: Creation and Analysis of a Bilingual Privacy Policy Corpus
Siddhant Arora | Henry Hosseini | Christine Utz | Vinayshekhar Bannihatti Kumar | Tristan Dhellemmes | Abhilasha Ravichander | Peter Story | Jasmine Mangat | Rex Chen | Martin Degeling | Thomas Norton | Thomas Hupperich | Shomir Wilson | Norman Sadeh
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Over the past decade, researchers have started to explore the use of NLP to develop tools aimed at helping the public, vendors, and regulators analyze disclosures made in privacy policies. With the introduction of new privacy regulations, the language of privacy policies is also evolving, and disclosures made by the same organization are not always the same in different languages, especially when used to communicate with users who fall under different jurisdictions. This work explores the use of language technologies to capture and analyze these differences at scale. We introduce an annotation scheme designed to capture the nuances of two new landmark privacy regulations, namely the EU’s GDPR and California’s CCPA/CPRA. We then introduce the first bilingual corpus of mobile app privacy policies consisting of 64 privacy policies in English (292K words) and 91 privacy policies in German (478K words), respectively with manual annotations for 8K and 19K fine-grained data practices. The annotations are used to develop computational methods that can automatically extract “disclosures” from privacy policies. Analysis of a subset of 59 “semi-parallel” policies reveals differences that can be attributed to different regulatory regimes, suggesting that systematic analysis of policies using automated language technologies is indeed a worthwhile endeavor.

Are Abstractive Summarization Models truly ‘Abstractive’? An Empirical Study to Compare the two Forms of Summarization
Vinayshekhar Bannihatti Kumar | Rashmi Gangadharaiah
Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

Automatic Text Summarization has seen a large paradigm shift from extractive methods to abstractive (or generation-based) methods in the last few years. This can be attributed to the availability of large autoregressive language models that have been shown to outperform extractive methods. In this work, we revisit extractive methods and study their performance against state of the art(SOTA) abstractive models. Through extensive studies, we notice that abstractive methods are not yet completely abstractive in their generated summaries. In addition to this finding, we propose an evaluation metric that could benefit the summarization research community to measure the degree of abstractiveness of a summary in comparison to their extractive counterparts. To confirm the generalizability of our findings, we conduct experiments on two summarization datasets using five powerful techniques in extractive and abstractive summarization and study their levels of abstraction.

Towards Cross-Domain Transferability of Text Generation Models for Legal Text
Vinayshekhar Bannihatti Kumar | Kasturi Bhattacharjee | Rashmi Gangadharaiah
Proceedings of the Natural Legal Language Processing Workshop 2022

Legalese can often be filled with verbose domain-specific jargon which can make it challenging to understand and use for non-experts. Creating succinct summaries of legal documents often makes it easier for user comprehension. However, obtaining labeled data for every domain of legal text is challenging, which makes cross-domain transferability of text generation models for legal text, an important area of research. In this paper, we explore the ability of existing state-of-the-art T5 & BART-based summarization models to transfer across legal domains. We leverage publicly available datasets across four domains for this task, one of which is a new resource for summarizing privacy policies, that we curate and release for academic research. Our experiments demonstrate the low cross-domain transferability of these models, while also highlighting the benefits of combining different domains. Further, we compare the effectiveness of standard metrics for this task and illustrate the vast differences in their performance.


SupportNet: Neural Networks for Summary Generation and Key Segment Extraction from Technical Support Tickets
Vinayshekhar Bannihatti Kumar | Mohan Yarramsetty | Sharon Sun | Anukul Goel
Proceedings of the 4th Workshop on e-Commerce and NLP

We improve customer experience and gain their trust when their issues are resolved rapidly with less friction. Existing work has focused on reducing the overall case resolution time by binning a case into predefined categories and routing it to the desired support engineer. However, the actions taken by the engineer during case analysis and resolution are altogether ignored, even though it forms the bulk of the case resolution time. In this work, we propose two systems that enable support engineers to resolve cases faster. The first, a guidance extraction model, mines historical cases and provides technical guidance phrases to the support engineers. The phrases can then be used to educate the customer or to obtain critical information needed to resolve the case and thus minimize the number of correspondences between the engineer and customer. The second, a summarization model, creates an abstractive summary of the case to provide better context to the support engineer. Through quantitative evaluation we obtain an F1 score of 0.64 on the guidance extraction model and a BertScore (F1) of 0.55 on the summarization model.


WriterForcing: Generating more interesting story endings
Prakhar Gupta | Vinayshekhar Bannihatti Kumar | Mukul Bhutani | Alan W Black
Proceedings of the Second Workshop on Storytelling

We study the problem of generating interesting endings for stories. Neural generative models have shown promising results for various text generation problems. Sequence to Sequence (Seq2Seq) models are typically trained to generate a single output sequence for a given input sequence. However, in the context of a story, multiple endings are possible. Seq2Seq models tend to ignore the context and generate generic and dull responses. Very few works have studied generating diverse and interesting story endings for the same story context. In this paper, we propose models which generate more diverse and interesting outputs by 1) training models to focus attention on important keyphrases of the story, and 2) promoting generating nongeneric words. We show that the combination of the two leads to more interesting endings.

Dr.Quad at MEDIQA 2019: Towards Textual Inference and Question Entailment using contextualized representations
Vinayshekhar Bannihatti Kumar | Ashwin Srinivasan | Aditi Chaudhary | James Route | Teruko Mitamura | Eric Nyberg
Proceedings of the 18th BioNLP Workshop and Shared Task

This paper presents the submissions by TeamDr.Quad to the ACL-BioNLP 2019 shared task on Textual Inference and Question Entailment in the Medical Domain. Our system is based on the prior work Liu et al. (2019) which uses a multi-task objective function for textual entailment. In this work, we explore different strategies for generalizing state-of-the-art language understanding models to the specialized medical domain. Our results on the shared task demonstrate that incorporating domain knowledge through data augmentation is a powerful strategy for addressing challenges posed specialized domains such as medicine.