2024
pdf
abs
Why Generate When You Can Discriminate? A Novel Technique for Text Classification using Language Models
Sachin Pawar
|
Nitin Ramrakhiyani
|
Anubhav Sinha
|
Manoj Apte
|
Girish Palshikar
Findings of the Association for Computational Linguistics: EACL 2024
In this paper, we propose a novel two-step technique for text classification using autoregressive Language Models (LM). In the first step, a set of perplexity and log-likelihood based numeric features are elicited from an LM for a text instance to be classified. Then, in the second step, a classifier based on these features is trained to predict the final label. The classifier used is usually a simple machine learning classifier like Support Vector Machine (SVM) or Logistic Regression (LR) and it is trained using a small set of training examples. We believe, our technique presents a whole new way of exploiting the available training instances, in addition to the existing ways like fine-tuning LMs or in-context learning. Our approach stands out by eliminating the need for parameter updates in LMs, as required in fine-tuning, and does not impose limitations on the number of training examples faced while building prompts for in-context learning. We evaluate our technique across 5 different datasets and compare with multiple competent baselines.
2023
pdf
abs
Legal Argument Extraction from Court Judgements using Integer Linear Programming
Basit Ali
|
Sachin Pawar
|
Girish Palshikar
|
Anindita Sinha Banerjee
|
Dhirendra Singh
Proceedings of the 10th Workshop on Argument Mining
Legal arguments are one of the key aspects of legal knowledge which are expressed in various ways in the unstructured text of court judgements. A large database of past legal arguments can be created by extracting arguments from court judgements, categorizing them, and storing them in a structured format. Such a database would be useful for suggesting suitable arguments for any new case. In this paper, we focus on extracting arguments from Indian Supreme Court judgements using minimal supervision. We first identify a set of certain sentence-level argument markers which are useful for argument extraction such as whether a sentence contains a claim or not, whether a sentence is argumentative in nature, whether two sentences are part of the same argument, etc. We then model the legal argument extraction problem as a text segmentation problem where we combine multiple weak evidences in the form of argument markers using Integer Linear Programming (ILP), finally arriving at a global document-level solution giving the most optimal legal arguments. We demonstrate the effectiveness of our technique by comparing it against several competent baselines.
pdf
abs
Zero-shot Probing of Pretrained Language Models for Geography Knowledge
Nitin Ramrakhiyani
|
Vasudeva Varma
|
Girish Palshikar
|
Sachin Pawar
Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems
Gauging the knowledge of Pretrained Language Models (PLMs) about facts in niche domains is an important step towards making them better in those domains. In this paper, we aim at evaluating multiple PLMs for their knowledge about world Geography. We contribute (i) a sufficiently sized dataset of masked Geography sentences to probe PLMs on masked token prediction and generation tasks, (ii) benchmark the performance of multiple PLMs on the dataset. We also provide a detailed analysis of the performance of the PLMs on different Geography facts.
pdf
abs
Audit Report Coverage Assessment using Sentence Classification
Sushodhan Vaishampayan
|
Nitin Ramrakhiyani
|
Sachin Pawar
|
Aditi Pawde
|
Manoj Apte
|
Girish Palshikar
Proceedings of the Sixth Workshop on Financial Technology and Natural Language Processing
Audit reports are a window to the financial health of a company and hence gauging coverage of various audit aspects in them is important. In this paper, we aim at determining an audit report’s coverage through classification of its sentences into multiple domain specific classes. In a weakly supervised setting, we employ a rule-based approach to automatically create training data for a BERT-based multi-label classifier. We then devise an ensemble to combine both the rule based and classifier approaches. Further, we employ two novel ways to improve the ensemble’s generalization: (i) through an active learning based approach and, (ii) through a LLM based review. We demonstrate that our proposed approaches outperform several baselines. We show utility of the proposed approaches to measure audit coverage on a large dataset of 2.8K audit reports.
pdf
bib
abs
Evaluation Metrics for Depth and Flow of Knowledge in Non-fiction Narrative Texts
Sachin Pawar
|
Girish Palshikar
|
Ankita Jain
|
Mahesh Singh
|
Mahesh Rangarajan
|
Aman Agarwal
|
Vishal Kumar
|
Karan Singh
Proceedings of the 5th Workshop on Narrative Understanding
In this paper, we describe the problem of automatically evaluating quality of knowledge expressed in a non-fiction narrative text. We focus on a specific type of documents where each document describes a certain technical problem and its solution. The goal is not only to evaluate the quality of knowledge in such a document, but also to automatically suggest possible improvements to the writer so that a better knowledge-rich document is produced. We propose new evaluation metrics to evaluate quality of knowledge contents as well as flow of different types of sentences. The suggestions for improvement are generated based on these metrics. The proposed metrics are completely unsupervised in nature and they are derived from a set of simple corpus statistics. We demonstrate the effectiveness of the proposed metrics as compared to other existing baseline metrics in our experiments.
2022
pdf
abs
Constructing A Dataset of Support and Attack Relations in Legal Arguments in Court Judgements using Linguistic Rules
Basit Ali
|
Sachin Pawar
|
Girish Palshikar
|
Rituraj Singh
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Argumentation mining is a growing area of research and has several interesting practical applications of mining legal arguments. Support and Attack relations are the backbone of any legal argument. However, there is no publicly available dataset of these relations in the context of legal arguments expressed in court judgements. In this paper, we focus on automatically constructing such a dataset of Support and Attack relations between sentences in a court judgment with reasonable accuracy. We propose three sets of rules based on linguistic knowledge and distant supervision to identify such relations from Indian Supreme Court judgments. The first rule set is based on multiple discourse connectors, the second rule set is based on common semantic structures between argumentative sentences in a close neighbourhood, and the third rule set uses the information about the source of the argument. We also explore a BERT-based sentence pair classification model which is trained on this dataset. We release the dataset of 20506 sentence pairs - 10746 Support (precision 77.3%) and 9760 Attack (precision 65.8%). We believe that this dataset and the ideas explored in designing the linguistic rules and will boost the argumentation mining research for legal arguments.
pdf
abs
Weakly Supervised Context-based Interview Question Generation
Samiran Pal
|
Kaamraan Khan
|
Avinash Kumar Singh
|
Subhasish Ghosh
|
Tapas Nayak
|
Girish Palshikar
|
Indrajit Bhattacharya
Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)
We explore the task of automated generation of technical interview questions from a given textbook. Such questions are different from those for reading comprehension studied in question generation literature. We curate a context based interview questions data set for Machine Learning and Deep Learning from two popular textbooks. We first explore the possibility of using a large generative language model (GPT-3) for this task in a zero shot setting. We then evaluate the performance of smaller generative models such as BART fine-tuned on weakly supervised data obtained using GPT-3 and hand-crafted templates. We deploy an automatic question importance assignment technique to figure out suitability of a question in a technical interview. It improves the evaluation results in many dimensions. We dissect the performance of these models for this task and also scrutinize the suitability of questions generated by them for use in technical interviews.
2021
pdf
abs
Generating An Optimal Interview Question Plan Using A Knowledge Graph And Integer Linear Programming
Soham Datta
|
Prabir Mallick
|
Sangameshwar Patil
|
Indrajit Bhattacharya
|
Girish Palshikar
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Given the diversity of the candidates and complexity of job requirements, and since interviewing is an inherently subjective process, it is an important task to ensure consistent, uniform, efficient and objective interviews that result in high quality recruitment. We propose an interview assistant system to automatically, and in an objective manner, select an optimal set of technical questions (from question banks) personalized for a candidate. This set can help a human interviewer to plan for an upcoming interview of that candidate. We formalize the problem of selecting a set of questions as an integer linear programming problem and use standard solvers to get a solution. We use knowledge graph as background knowledge in this formulation, and derive our objective functions and constraints from it. We use candidate’s resume to personalize the selection of questions. We propose an intrinsic evaluation to compare a set of suggested questions with actually asked questions. We also use expert interviewers to comparatively evaluate our approach with a set of reasonable baselines.
pdf
abs
Temporal Question Generation from History Text
Harsimran Bedi
|
Sangameshwar Patil
|
Girish Palshikar
Proceedings of the 18th International Conference on Natural Language Processing (ICON)
Temporal analysis of history text has always held special significance to students, historians and the Social Sciences community in general. We observe from experimental data that existing deep learning (DL) models of ProphetNet and UniLM for question generation (QG) task do not perform satisfactorily when used directly for temporal QG from history text. We propose linguistically motivated templates for generating temporal questions that probe different aspects of history text and show that finetuning the DL models using the temporal questions significantly improves their performance on temporal QG task. Using automated metrics as well as human expert evaluation, we show that performance of the DL models finetuned with the template-based questions is better than finetuning done with temporal questions from SQuAD.
pdf
abs
Weakly Supervised Extraction of Tasks from Text
Sachin Pawar
|
Girish Palshikar
|
Anindita Sinha Banerjee
Proceedings of the 18th International Conference on Natural Language Processing (ICON)
In this paper, we propose a novel problem of automatic extraction of tasks from text. A task is a well-defined knowledge-based volitional action. We describe various characteristics of tasks as well as compare and contrast them with events. We propose two techniques for task extraction – i) using linguistic patterns and ii) using a BERT-based weakly supervised neural model. We evaluate our techniques with other competent baselines on 4 datasets from different domains. Overall, the BERT-based weakly supervised neural model generalizes better across multiple domains as compared to the purely linguistic patterns based approach.
pdf
abs
Extracting Events from Industrial Incident Reports
Nitin Ramrakhiyani
|
Swapnil Hingmire
|
Sangameshwar Patil
|
Alok Kumar
|
Girish Palshikar
Proceedings of the 4th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021)
Incidents in industries have huge social and political impact and minimizing the consequent damage has been a high priority. However, automated analysis of repositories of incident reports has remained a challenge. In this paper, we focus on automatically extracting events from incident reports. Due to absence of event annotated datasets for industrial incidents we employ a transfer learning based approach which is shown to outperform several baselines. We further provide detailed analysis regarding effect of increase in pre-training data and provide explainability of why pre-training improves the performance.
pdf
FrameNet-assisted Noun Compound Interpretation
Girishkumar Ponkiya
|
Diptesh Kanojia
|
Pushpak Bhattacharyya
|
Girish Palshikar
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
2020
pdf
abs
Weak Supervision using Linguistic Knowledge for Information Extraction
Sachin Pawar
|
Girish Palshikar
|
Ankita Jain
|
Jyoti Bhat
|
Simi Johnson
Proceedings of the 17th International Conference on Natural Language Processing (ICON)
In this paper, we propose to use linguistic knowledge to automatically augment a small manually annotated corpus to obtain a large annotated corpus for training Information Extraction models. We propose a powerful patterns specification language for specifying linguistic rules for entity extraction. We define an Enriched Text Format (ETF) to represent rich linguistic information about a text in the form of XML-like tags. The patterns in our patterns specification language are then matched on the ETF text rather than raw text to extract various entity mentions. We demonstrate how an entity extraction system can be quickly built for a domain-specific entity type for which there are no readily available annotated datasets.
pdf
abs
Looking inside Noun Compounds: Unsupervised Prepositional and Free Paraphrasing
Girishkumar Ponkiya
|
Rudra Murthy
|
Pushpak Bhattacharyya
|
Girish Palshikar
Findings of the Association for Computational Linguistics: EMNLP 2020
A noun compound is a sequence of contiguous nouns that acts as a single noun, although the predicate denoting the semantic relation between its components is dropped. Noun Compound Interpretation is the task of uncovering the relation, in the form of a preposition or a free paraphrase. Prepositional paraphrasing refers to the use of preposition to explain the semantic relation, whereas free paraphrasing refers to invoking an appropriate predicate denoting the semantic relation. In this paper, we propose an unsupervised methodology for these two types of paraphrasing. We use pre-trained contextualized language models to uncover the ‘missing’ words (preposition or predicate). These language models are usually trained to uncover the missing word/words in a given input sentence. Our approach uses templates to prepare the input sequence for the language model. The template uses a special token to indicate the missing predicate. As the model has already been pre-trained to uncover a missing word (or a sequence of words), we exploit it to predict missing words for the input sequence. Our experiments using four datasets show that our unsupervised approach (a) performs comparably to supervised approaches for prepositional paraphrasing, and (b) outperforms supervised approaches for free paraphrasing. Paraphrasing (prepositional or free) using our unsupervised approach is potentially helpful for NLP tasks like machine translation and information extraction.
pdf
abs
Extracting Message Sequence Charts from Hindi Narrative Text
Swapnil Hingmire
|
Nitin Ramrakhiyani
|
Avinash Kumar Singh
|
Sangameshwar Patil
|
Girish Palshikar
|
Pushpak Bhattacharyya
|
Vasudeva Varma
Proceedings of the First Joint Workshop on Narrative Understanding, Storylines, and Events
In this paper, we propose the use of Message Sequence Charts (MSC) as a representation for visualizing narrative text in Hindi. An MSC is a formal representation allowing the depiction of actors and interactions among these actors in a scenario, apart from supporting a rich framework for formal inference. We propose an approach to extract MSC actors and interactions from a Hindi narrative. As a part of the approach, we enrich an existing event annotation scheme where we provide guidelines for annotation of the mood of events (realis vs irrealis) and guidelines for annotation of event arguments. We report performance on multiple evaluation criteria by experimenting with Hindi narratives from Indian History. Though Hindi is the fourth most-spoken first language in the world, from the NLP perspective it has comparatively lesser resources than English. Moreover, there is relatively less work in the context of event processing in Hindi. Hence, we believe that this work is among the initial works for Hindi event processing.
2019
pdf
abs
Extraction of Message Sequence Charts from Narrative History Text
Girish Palshikar
|
Sachin Pawar
|
Sangameshwar Patil
|
Swapnil Hingmire
|
Nitin Ramrakhiyani
|
Harsimran Bedi
|
Pushpak Bhattacharyya
|
Vasudeva Varma
Proceedings of the First Workshop on Narrative Understanding
In this paper, we advocate the use of Message Sequence Chart (MSC) as a knowledge representation to capture and visualize multi-actor interactions and their temporal ordering. We propose algorithms to automatically extract an MSC from a history narrative. For a given narrative, we first identify verbs which indicate interactions and then use dependency parsing and Semantic Role Labelling based approaches to identify senders (initiating actors) and receivers (other actors involved) for these interaction verbs. As a final step in MSC extraction, we employ a state-of-the art algorithm to temporally re-order these interactions. Our evaluation on multiple publicly available narratives shows improvements over four baselines.
pdf
bib
Towards Disambiguating Contracts for their Successful Execution - A Case from Finance Domain
Preethu Rose Anish
|
Abhishek Sainani
|
Nitin Ramrakhiyani
|
Sachin Pawar
|
Girish K Palshikar
|
Smita Ghaisas
Proceedings of the First Workshop on Financial Technology and Natural Language Processing
pdf
abs
Extraction of Message Sequence Charts from Software Use-Case Descriptions
Girish Palshikar
|
Nitin Ramrakhiyani
|
Sangameshwar Patil
|
Sachin Pawar
|
Swapnil Hingmire
|
Vasudeva Varma
|
Pushpak Bhattacharyya
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers)
Software Requirement Specification documents provide natural language descriptions of the core functional requirements as a set of use-cases. Essentially, each use-case contains a set of actors and sequences of steps describing the interactions among them. Goals of use-case reviews and analyses include their correctness, completeness, detection of ambiguities, prototyping, verification, test case generation and traceability. Message Sequence Chart (MSC) have been proposed as a expressive, rigorous yet intuitive visual representation of use-cases. In this paper, we describe a linguistic knowledge-based approach to extract MSCs from use-cases. Compared to existing techniques, we extract richer constructs of the MSC notation such as timers, conditions and alt-boxes. We apply this tool to extract MSCs from several real-life software use-case descriptions and show that it performs better than the existing techniques. We also discuss the benefits and limitations of the extracted MSCs to meet the above goals.
2018
pdf
Resolving Actor Coreferences in Hindi Narrative Text
Nitin Ramrakhiyani
|
Swapnil Hingmire
|
Sachin Pawar
|
Sangameshwar Patil
|
Girish K. Palshikar
|
Pushpak Bhattacharyya
|
Vasudeva Verma
Proceedings of the 15th International Conference on Natural Language Processing
pdf
abs
Identification of Alias Links among Participants in Narratives
Sangameshwar Patil
|
Sachin Pawar
|
Swapnil Hingmire
|
Girish Palshikar
|
Vasudeva Varma
|
Pushpak Bhattacharyya
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Identification of distinct and independent participants (entities of interest) in a narrative is an important task for many NLP applications. This task becomes challenging because these participants are often referred to using multiple aliases. In this paper, we propose an approach based on linguistic knowledge for identification of aliases mentioned using proper nouns, pronouns or noun phrases with common noun headword. We use Markov Logic Network (MLN) to encode the linguistic knowledge for identification of aliases. We evaluate on four diverse history narratives of varying complexity. Our approach performs better than the state-of-the-art approach as well as a combination of standard named entity recognition and coreference resolution techniques.
pdf
abs
Treat us like the sequences we are: Prepositional Paraphrasing of Noun Compounds using LSTM
Girishkumar Ponkiya
|
Kevin Patel
|
Pushpak Bhattacharyya
|
Girish Palshikar
Proceedings of the 27th International Conference on Computational Linguistics
Interpreting noun compounds is a challenging task. It involves uncovering the underlying predicate which is dropped in the formation of the compound. In most cases, this predicate is of the form VERB+PREP. It has been observed that uncovering the preposition is a significant step towards uncovering the predicate. In this paper, we attempt to paraphrase noun compounds using prepositions. We consider noun compounds and their corresponding prepositional paraphrases as parallelly aligned sequences of words. This enables us to adapt different architectures from cross-lingual embedding literature. We choose the architecture where we create representations of both noun compound (source sequence) and its corresponding prepositional paraphrase (target sequence), such that their sim- ilarity is high. We use LSTMs to learn these representations. We use these representations to decide the correct preposition. Our experiments show that this approach performs considerably well on different datasets of noun compounds that are manually annotated with prepositions.
pdf
Towards a Standardized Dataset for Noun Compound Interpretation
Girishkumar Ponkiya
|
Kevin Patel
|
Pushpak Bhattacharyya
|
Girish K Palshikar
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2017
pdf
abs
End-to-end Relation Extraction using Neural Networks and Markov Logic Networks
Sachin Pawar
|
Pushpak Bhattacharyya
|
Girish Palshikar
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers
End-to-end relation extraction refers to identifying boundaries of entity mentions, entity types of these mentions and appropriate semantic relation for each pair of mentions. Traditionally, separate predictive models were trained for each of these tasks and were used in a “pipeline” fashion where output of one model is fed as input to another. But it was observed that addressing some of these tasks jointly results in better performance. We propose a single, joint neural network based model to carry out all the three tasks of boundary identification, entity type classification and relation type classification. This model is referred to as “All Word Pairs” model (AWP-NN) as it assigns an appropriate label to each word pair in a given sentence for performing end-to-end relation extraction. We also propose to refine output of the AWP-NN model by using inference in Markov Logic Networks (MLN) so that additional domain knowledge can be effectively incorporated. We demonstrate effectiveness of our approach by achieving better end-to-end relation extraction performance than all 4 previous joint modelling approaches, on the standard dataset of ACE 2004.
pdf
abs
Measuring Topic Coherence through Optimal Word Buckets
Nitin Ramrakhiyani
|
Sachin Pawar
|
Swapnil Hingmire
|
Girish Palshikar
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers
Measuring topic quality is essential for scoring the learned topics and their subsequent use in Information Retrieval and Text classification. To measure quality of Latent Dirichlet Allocation (LDA) based topics learned from text, we propose a novel approach based on grouping of topic words into buckets (TBuckets). A single large bucket signifies a single coherent theme, in turn indicating high topic coherence. TBuckets uses word embeddings of topic words and employs singular value decomposition (SVD) and Integer Linear Programming based optimization to create coherent word buckets. TBuckets outperforms the state-of-the-art techniques when evaluated using 3 publicly available datasets and on another one proposed in this paper.
pdf
abs
Event Timeline Generation from History Textbooks
Harsimran Bedi
|
Sangameshwar Patil
|
Swapnil Hingmire
|
Girish Palshikar
Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017)
Event timeline serves as the basic structure of history, and it is used as a disposition of key phenomena in studying history as a subject in secondary school. In order to enable a student to understand a historical phenomenon as a series of connected events, we present a system for automatic event timeline generation from history textbooks. Additionally, we propose Message Sequence Chart (MSC) and time-map based visualization techniques to visualize an event timeline. We also identify key computational challenges in developing natural language processing based applications for history textbooks.
pdf
Experiments with Domain Dependent Dialogue Act Classification using Open-Domain Dialogue Corpora
Swapnil Hingmire
|
Apoorv Shrivastava
|
Girish Palshikar
|
Saurabh Srivastava
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)
2016
pdf
Learning to Identify Subjective Sentences
Girish K. Palshikar
|
Manoj Apte
|
Deepak Pandita
|
Vikram Singh
Proceedings of the 13th International Conference on Natural Language Processing
pdf
On Why Coarse Class Classification is Bottleneck in Noun Compound Interpretation
Girishkumar Ponkiya
|
Pushpak Bhattacharyya
|
Girish K. Palshikar
Proceedings of the 13th International Conference on Natural Language Processing
2015
pdf
Noun Phrase Chunking for Marathi using Distant Supervision
Sachin Pawar
|
Nitin Ramrakhiyani
|
Girish K. Palshikar
|
Pushpak Bhattacharyya
|
Swapnil Hingmire
Proceedings of the 12th International Conference on Natural Language Processing
2014
pdf
LMSim : Computing Domain-specific Semantic Word Similarities Using a Language Modeling Approach
Sachin Pawar
|
Swapnil Hingmire
|
Girish K. Palshikar
Proceedings of the 11th International Conference on Natural Language Processing
2013
pdf
Named Entity Extraction using Information Distance
Sangameshwar Patil
|
Sachin Pawar
|
Girish Palshikar
Proceedings of the Sixth International Joint Conference on Natural Language Processing