Arman Kazmi

2022

Recently, fine-tuned transformer-based models (e.g., PubMedBERT, BioBERT) have shown the state-of-the-art performance of a number of BioNLP tasks, such as Named Entity Recognition (NER). However, transformer-based models are complex and have millions of parameters, and, consequently, are relatively slow during inference. In this paper, we address the time complexity limitations of the BioNLP transformer models. In particular, we propose a Multi-Task Learning based framework for jointly learning three different biomedical NER tasks. Our experiments show a reduction in inference time by a factor of three without any reduction in prediction accuracy.

pdf abs
Linguistically Motivated Features for Classifying Shorter Text into Fiction and Non-Fiction Genre
Arman Kazmi | Sidharth Ranjan | Arpit Sharma | Rajakrishnan Rajkumar
Proceedings of the 29th International Conference on Computational Linguistics

This work deploys linguistically motivated features to classify paragraph-level text into fiction and non-fiction genre using a logistic regression model and infers lexical and syntactic properties that distinguish the two genres. Previous works have focused on classifying document-level text into fiction and non-fiction genres, while in this work, we deal with shorter texts which are closer to real-world applications like sentiment analysis of tweets. Going beyond simple POS tag ratios proposed in Qureshi et al.(2019) for document-level classification, we extracted multiple linguistically motivated features belonging to four categories: Lexical features, POS ratio features, Syntactic features and Raw features. For the task of short-text classification, a model containing 28 best-features (selected via Recursive feature elimination with cross-validation; RFECV) confers an accuracy jump of 15.56 % over a baseline model consisting of 2 POS-ratio features found effective in previous work (cited above). The efficacy of the above model containing a linguistically motivated feature set also transfers over to another dataset viz, Baby BNC corpus. We also compared the classification accuracy of the logistic regression model with two deep-learning models. A 1D CNN model gives an increase of 2% accuracy over the logistic Regression classifier on both corpora. And the BERT-base-uncased model gives the best classification accuracy of 97% on Brown corpus and 98% on Baby BNC corpus. Although both the deep learning models give better results in terms of classification accuracy, the problem of interpreting these models remains unsolved. In contrast, regression model coefficients revealed that fiction texts tend to have more character-level diversity and have lower lexical density (quantified using content-function word ratios) compared to non-fiction texts. Moreover, subtle differences in word order exist between the two genres, i.e., in fiction texts Verbs precede Adverbs (inter-alia).

Co-authors

Ashutosh Modi 1

Sidharth Ranjan 1

Arpit Sharma 1

Rajakrishnan Rajkumar 1

Venues

icon1
coling1