Ashish Anand


2026

Event trigger detection and classification aim to identify and categorize events within unstructured text. While prior research has primarily focused on news or biomedical corpora, the literary domain, especially short stories, remains largely underexplored. This gap is particularly pronounced for low-resource languages such as Assamese, where limited annotated data and complex narrative structures hinder progress. To address this challenge, we introduce Vrittanta-AS, a manually curated Assamese event trigger detection and classification dataset comprising 13,171 annotated events extracted from short stories. The dataset is designed to advance research in information extraction and narrative understanding for low-resource Indian languages. We conduct a comprehensive evaluation using classical machine learning methods, neural sequential architectures, pre-trained transformer models, and large language models (LLMs) on the proposed dataset. Experimental results demonstrate that IndicBERT v2 achieves the highest performance for both event trigger detection (85.86% micro-F1) and classification (65.21% macro-F1). Vrittanta-AS serves as an important step toward developing benchmark resources for event trigger detection and classification in Assamese literary text.
Event trigger detection and classification involve identifying meaningful occurrences and categorizing them into predefined event types within narrative text. Despite extensive research on English event extraction in factual domains like news and biomedical text, narrative prose, such as short stories, has received comparatively little attention. To bridge this gap, Vrittanta-EN introduces a manually annotated English corpus comprising 11,272 event instances extracted from diverse short stories. The dataset captures a wide range of communicative, cognitive, and physical actions typical of narrative discourse. A comprehensive evaluation is conducted across a wide range of models, including classical machine learning baselines (SVM, Naive Bayes), neural sequential models (LSTM, BiLSTM, BiLSTM-CRF), encoder-only transformers (BERT, RoBERTa, ALBERT, DistilBERT, DeBERTa, ELECTRA), and encoder-decoder models (T5, BART), along with large language models (GPT-4.1, DeepSeek-V3.2-Exp, Claude Sonnet 4) under both zero-shot and five-shot settings. Experimental results show that ELECTRA achieved the highest overall performance for event trigger detection with an F1-score of 90.61%, while RoBERTa demonstrated superior performance for event classification with a macro F1 of 74.71%. These findings highlight the robustness of contextual transformer-based architectures for modeling narrative event structures in English short stories. The dataset, code, and annotation guidelines will be publicly released upon paper acceptance.
Named entity recognition (NER), particularly fine-grained NER (FgNER), extracts domain-specific entity information for Natural Language Processing (NLP) applications such as knowledge base construction and relation extraction. While manual annotation for creating relevant data is expensive, distant supervision often produces noisy data. Moreover, resources for coarse-grained and fine-grained NER in Indian languages, particularly in the vulnerable languages of India’s North Eastern Region, remain scarce. This work aims at creating such a resource for three vulnerable languages: <i>Bodo/Boro (brx)</i>, <i>Manipuri/Meitei (mni)</i>, and <i>Mizo/Lushai (lus)</i>, which are regarded as official languages in three Indian states and spoken by more than six million people across five countries in South and Southeast Asia. We use annotations projection from high-resource FgNER datasets using source-to-target parallel corpora and a projection tool built on a multilingual encoder. The dataset comprises over 198k sentences, 282k entities, and 2.8M tokens in each low-resource language. Our thorough analyses validate the dataset’s high quality. We further explore zero-shot and cross-lingual settings, examining the impact of script similarity and multilingualism in cross-lingual FgNER performance. The dataset, expert detector models, the agentic tool, and the interactive web application are available as open-source resources at: <url>https://hf.co/collections/prachuryyaIITG/finerviner</url>.
We present APTFiNER, a novel fine-grained named entity recognition (FgNER) dataset covering six low-resource Indian languages spoken by over 400 million people across various nations. While creating FgNER resources through manual annotation is typically expensive and labor-intensive, distant supervision has emerged as a workable alternative. Yet, such FgNER datasets are often noisy, as each entity mentions are often assigned multiple entity types, which necessitates computationally demanding noise-aware models. Furthermore, resources for both coarse-grained and fine-grained NER tasks remain scarce for low-resource languages. To overcome this scarcity, we utilized the superior reasoning and translation capability of Gemini through the proposed annotation-preserving translation method and created a large-scale FgNER dataset comprising over 411 thousand sentences, 697 thousand entity mentions, and 5.8 million tokens in total. We translated the MultiCoNER2 English FgNER dataset to the target languages: <i>Assamese (as)</i>, <i>Marathi (mr)</i>, <i>Nepali (ne)</i>, <i>Tamil (ta)</i>, <i>Telugu (te)</i>, and a vulnerable language, <i>Bodo (brx)</i>. Through rigorous analyses and human evaluations, the effectiveness of our method and the high quality of the resulting dataset are ascertained with F1 score improvements of 8% in both Tamil and Telugu, and 25% in Marathi over the current state-of-the-art. The dataset, expert detector models, the agentic tool, and the interactive web application are available as open-source resources at: <url>https://hf.co/collections/prachuryyaIITG/aptfiner</url>.
Relational Argument Mining (RAM) is a key task of computational argumentation, which aims to classify the relationships such as Support or Attack between argument component (AC) pairs. Traditional approaches primarily rely on graph-based modelling with external knowledge sources, which are complex in nature. Also, these approaches struggle with RAM datasets when relation classes are imbalanced, as they are not designed for class-imbalanced scenarios. In this work, we propose CIARAM framework to reformulate RAM as a text-to-text generation problem to generate relational labels in a flattened text format. To address the class imbalance, we employ a data augmentation strategy using a decoder-only Large Language Model (LLM) to balance the underrepresented relation classes. Across five standard RAM benchmarks, CIARAM produces strong results, specifically with the billion-parameter model, with a substantial gain in performance compared to the latest baseline, demonstrating the strong potential of our approach.

2025

We introduce CLASSER, a cross-lingual annotation projection framework enhanced through script similarity, to create fine-grained named entity recognition (FgNER) datasets for low-resource languages. Manual annotation for named entity recognition (NER) is expensive, and distant supervision often produces noisy data that are often limited to high-resource languages. CLASSER employs a two-stage process: first projection of annotations from high-resource NER datasets to target language by using source-to-target parallel corpora and a projection tool built on a multilingual encoder, then refining them by leveraging datasets in script-similar languages. We apply this to five low-resource Indian languages: Assamese, Marathi, Nepali, Sanskrit, and Bodo, a vulnerable language. The resulting dataset comprises 1.8M sentences, 2.6M entity mentions and 24.7M tokens. Through rigorous analyses, the effectiveness of our method and the high quality of the resulting dataset are ascertained with F1 score improvements of 26% in Marathi and 46% in Sanskrit over the current state-of-the-art. We further extend our analyses to zero-shot and cross-lingual settings, systematically investigating the impact of script similarity and multilingualism on cross-lingual FgNER performance. The dataset is publicly available at huggingface.co/datasets/prachuryyaIITG/CLASSER.
Argument mining (AM) focuses on analyzing argumentative structures such as Argument Components (ACs) and Argumentative Relations (ARs). Modeling dependencies between ACs and ARs is challenging due to the complex interactions between ACs. Existing approaches often overlook crucial conceptual links, such as key phrases that connect two related ACs, and tend to rely on cartesian product methods to model these dependencies, which can result in class imbalances. To extract key phrases from the AM benchmarks, we employ a prompt-based strategy utilizing an open-source Large Language Model (LLM). Building on this, we propose a unified text-to-text generation framework that leverages Augmented Natural Language (ANL) formatting and integrates the extracted key phrases inside the ANL itself to efficiently solve multiple AM tasks in a joint formulation. Our method sets new State-of-the-Art (SoTA) on three structurally distinct standard AM benchmarks, surpassing baselines by up to 9.5% F1 score, demonstrating its strong potential.

2023

This paper presents an annotated corpora of Assamese and English short stories for event trigger detection. This marks a pioneering endeavor in short stories, contributing to developing resources for this genre, especially in the low-resource Assamese language. In the process, 200 short stories were manually annotated in both Assamese and English. The dataset was evaluated and several models were compared for predicting events that are actually happening, i.e., realis events. However, it is expensive to develop manually annotated language resources, especially when the text requires specialist knowledge to interpret. In this regard, TagIT, an automated event annotation tool, is introduced. TagIT is designed to facilitate our objective of expanding the dataset from 200 to 1,000. The best-performing model was employed in TagIT to automate the event annotation process. Extensive experiments were conducted to evaluate the quality of the expanded dataset. This study further illustrates how the combination of an automatic annotation tool and human-in-the-loop participation significantly reduces the time needed to generate a high-quality dataset.

2020

We introduce a generic, human-out-of-the-loop pipeline, ERLKG, to perform rapid association analysis of any biomedical entity with other existing entities from a corpora of the same domain. Our pipeline consists of a Knowledge Graph (KG) created from the Open Source CORD-19 dataset by fully automating the procedure of information extraction using SciBERT. The best latent entity representations are then found by benchnmarking different KG embedding techniques on the task of link prediction using a Graph Convolution Network Auto Encoder (GCN-AE). We demonstrate the utility of ERLKG with respect to COVID-19 through multiple qualitative evaluations. Due to the lack of a gold standard, we propose a relatively large intrinsic evaluation dataset for COVID-19 and use it for validating the top two performing KG embedding techniques. We find TransD to be the best performing KG embedding technique with Pearson and Spearman correlation scores of 0.4348 and 0.4570 respectively. We demonstrate that a considerable number of ERLKG’s top protein, chemical and disease predictions are currently in consideration for COVID-19 related research.

2017

The task of relation classification in the biomedical domain is complex due to the presence of samples obtained from heterogeneous sources such as research articles, discharge summaries, or electronic health records. It is also a constraint for classifiers which employ manual feature engineering. In this paper, we propose a convolutional recurrent neural network (CRNN) architecture that combines RNNs and CNNs in sequence to solve this problem. The rationale behind our approach is that CNNs can effectively identify coarse-grained local features in a sentence, while RNNs are more suited for long-term dependencies. We compare our CRNN model with several baselines on two biomedical datasets, namely the i2b2-2010 clinical relation extraction challenge dataset, and the SemEval-2013 DDI extraction dataset. We also evaluate an attentive pooling technique and report its performance in comparison with the conventional max pooling method. Our results indicate that the proposed model achieves state-of-the-art performance on both datasets.
Fine-grained entity type classification (FETC) is the task of classifying an entity mention to a broad set of types. Distant supervision paradigm is extensively used to generate training data for this task. However, generated training data assigns same set of labels to every mention of an entity without considering its local context. Existing FETC systems have two major drawbacks: assuming training data to be noise free and use of hand crafted features. Our work overcomes both drawbacks. We propose a neural network model that jointly learns entity mentions and their context representation to eliminate use of hand crafted features. Our model treats training data as noisy and uses non-parametric variant of hinge loss function. Experiments show that the proposed model outperforms previous state-of-the-art methods on two publicly available datasets, namely FIGER (GOLD) and BBN with an average relative improvement of 2.69% in micro-F1 score. Knowledge learnt by our model on one dataset can be transferred to other datasets while using same model or other FETC systems. These approaches of transferring knowledge further improve the performance of respective models.
Biomedical events describe complex interactions between various biomedical entities. Event trigger is a word or a phrase which typically signifies the occurrence of an event. Event trigger identification is an important first step in all event extraction methods. However many of the current approaches either rely on complex hand-crafted features or consider features only within a window. In this paper we propose a method that takes the advantage of recurrent neural network (RNN) to extract higher level features present across the sentence. Thus hidden state representation of RNN along with word and entity type embedding as features avoid relying on the complex hand-crafted features generated using various NLP toolkits. Our experiments have shown to achieve state-of-art F1-score on Multi Level Event Extraction (MLEE) corpus. We have also performed category-wise analysis of the result and discussed the importance of various features in trigger identification task.

2016

2015