Arpita Roy


Incorporating medical knowledge in BERT for clinical relation extraction
Arpita Roy | Shimei Pan
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

In recent years pre-trained language models (PLM) such as BERT have proven to be very effective in diverse NLP tasks such as Information Extraction, Sentiment Analysis and Question Answering. Trained with massive general-domain text, these pre-trained language models capture rich syntactic, semantic and discourse information in the text. However, due to the differences between general and specific domain text (e.g., Wikipedia versus clinic notes), these models may not be ideal for domain-specific tasks (e.g., extracting clinical relations). Furthermore, it may require additional medical knowledge to understand clinical text properly. To solve these issues, in this research, we conduct a comprehensive examination of different techniques to add medical knowledge into a pre-trained BERT model for clinical relation extraction. Our best model outperforms the state-of-the-art systems on the benchmark i2b2/VA 2010 clinical relation extraction dataset.


Supervising Unsupervised Open Information Extraction Models
Arpita Roy | Youngja Park | Taesung Lee | Shimei Pan
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

We propose a novel supervised open information extraction (Open IE) framework that leverages an ensemble of unsupervised Open IE systems and a small amount of labeled data to improve system performance. It uses the outputs of multiple unsupervised Open IE systems plus a diverse set of lexical and syntactic information such as word embedding, part-of-speech embedding, syntactic role embedding and dependency structure as its input features and produces a sequence of word labels indicating whether the word belongs to a relation, the arguments of the relation or irrelevant. Comparing with existing supervised Open IE systems, our approach leverages the knowledge in existing unsupervised Open IE systems to overcome the problem of insufficient training data. By employing multiple unsupervised Open IE systems, our system learns to combine the strength and avoid the weakness in each individual Open IE system. We have conducted experiments on multiple labeled benchmark data sets. Our evaluation results have demonstrated the superiority of the proposed method over existing supervised and unsupervised models by a significant margin.

Predicting Malware Attributes from Cybersecurity Texts
Arpita Roy | Youngja Park | Shimei Pan
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Text analytics is a useful tool for studying malware behavior and tracking emerging threats. The task of automated malware attribute identification based on cybersecurity texts is very challenging due to a large number of malware attribute labels and a small number of training instances. In this paper, we propose a novel feature learning method to leverage diverse knowledge sources such as small amount of human annotations, unlabeled text and specifications about malware attribute labels. Our evaluation has demonstrated the effectiveness of our method over the state-of-the-art malware attribute prediction systems.


UMBC at SemEval-2018 Task 8: Understanding Text about Malware
Ankur Padia | Arpita Roy | Taneeya Satyapanich | Francis Ferraro | Shimei Pan | Youngja Park | Anupam Joshi | Tim Finin
Proceedings of the 12th International Workshop on Semantic Evaluation

We describe the systems developed by the UMBC team for 2018 SemEval Task 8, SecureNLP (Semantic Extraction from CybersecUrity REports using Natural Language Processing). We participated in three of the sub-tasks: (1) classifying sentences as being relevant or irrelevant to malware, (2) predicting token labels for sentences, and (4) predicting attribute labels from the Malware Attribute Enumeration and Characterization vocabulary for defining malware characteristics. We achieve F1 score of 50.34/18.0 (dev/test), 22.23 (test-data), and 31.98 (test-data) for Task1, Task2 and Task2 respectively. We also make our cybersecurity embeddings publicly available at