Dhrubajyoti Pathak

2026

Vrittanta-AS: Dataset Development and Benchmarking for Event Trigger Detection and Classification in Assamese
Chaitanya Kirti | Dhrubajyoti Pathak | Ashish Anand | Prithwijit Guha
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Event trigger detection and classification aim to identify and categorize events within unstructured text. While prior research has primarily focused on news or biomedical corpora, the literary domain, especially short stories, remains largely underexplored. This gap is particularly pronounced for low-resource languages such as Assamese, where limited annotated data and complex narrative structures hinder progress. To address this challenge, we introduce Vrittanta-AS, a manually curated Assamese event trigger detection and classification dataset comprising 13,171 annotated events extracted from short stories. The dataset is designed to advance research in information extraction and narrative understanding for low-resource Indian languages. We conduct a comprehensive evaluation using classical machine learning methods, neural sequential architectures, pre-trained transformer models, and large language models (LLMs) on the proposed dataset. Experimental results demonstrate that IndicBERT v2 achieves the highest performance for both event trigger detection (85.86% micro-F1) and classification (65.21% macro-F1). Vrittanta-AS serves as an important step toward developing benchmark resources for event trigger detection and classification in Assamese literary text.

2025

pdf bib

AsRED: Development and Evaluation of an Assamese Reduplication Dataset
Pankaj Choudhury | Chaitanya Kirti | Dhrubajyoti Pathak | Sukumar Nandi
Proceedings of the 39th Pacific Asia Conference on Language, Information and Computation

2024

pdf bib abs

Evaluating Performance of Pre-trained Word Embeddings on Assamese, a Low-resource Language
Dhrubajyoti Pathak | Sukumar Nandi | Priyankoo Sarmah
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Word embeddings and Language models are the building blocks of modern Deep Neural Network-based Natural Language Processing. They are extensively explored in high-resource languages and provide state-of-the-art (SOTA) performance for a wide range of downstream tasks. Nevertheless, these word embeddings are not explored in languages such as Assamese, where resources are limited. Furthermore, there has been limited study into the performance evaluation of these word embeddings for low-resource languages in downstream tasks. In this research, we explore the current state of Assamese pre-trained word embeddings. We evaluate these embeddings’ performance on sequence labeling tasks such as Parts-of-speech and Named Entity Recognition. In order to assess the efficiency of the embeddings, experiments are performed utilizing both ensemble and individual word embedding approaches. The ensembling approach that uses three word embeddings outperforms the others. In the paper, the outcomes of the investigations are described. The results of this comparative performance evaluation may assist researchers in choosing an Assamese pre-trained word embedding for subsequent tasks.

2022

pdf bib abs

AsNER - Annotated Dataset and Baseline for Assamese Named Entity recognition
Dhrubajyoti Pathak | Sukumar Nandi | Priyankoo Sarmah
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present the AsNER, a named entity annotation dataset for low resource Assamese language with a baseline Assamese NER model. The dataset contains about 99k tokens comprised of text from the speech of the Prime Minister of India and Assamese play. It also contains person names, location names and addresses. The proposed NER dataset is likely to be a significant resource for deep neural based Assamese language processing. We benchmark the dataset by training NER models and evaluating using state-of-the-art architectures for supervised named entity recognition (NER) such as Fasttext, BERT, XLM-R, FLAIR, MuRIL etc. We implement several baseline approaches with state-of-the-art sequence tagging Bi-LSTM-CRF architecture. The highest F1-score among all baselines achieves an accuracy of 80.69% when using MuRIL as a word embedding method. The annotated dataset and the top performing model are made publicly available.

Co-authors

Prithwijit Guha 1

Venues

Fix author