Hong-Jie Dai

2020

In this paper, we described our systems for the first and second subtasks of Social Media Mining for Health Applications (SMM4H) shared task in 2020. The two subtasks are automatic classi-fication of medication mentions and adverse effect in tweets. Our systems for both subtasks are based on Robustly optimized BERT approach (RoBERTa) and our previous work at SMM4H’19. The best F1-scores achieved by our systems for subtask 1 and 2 were 0.7974 and 0.64 respec-tively, which outperformed the average F1-scores among all teams’ best runs by at least 0.13.

A cancer registry is a critical and massive database for which various types of domain knowledge are needed and whose maintenance requires labor-intensive data curation. In order to facilitate the curation process for building a high-quality and integrated cancer registry database, we compiled a cross-hospital corpus and applied neural network methods to develop a natural language processing system for extracting cancer registry variables buried in unstructured pathology reports. The performance of the developed networks was compared with various baselines using standard micro-precision, recall and F-measure. Furthermore, we conducted experiments to study the feasibility of applying transfer learning to rapidly develop a well-performing system for processing reports from different sources that might be presented in different writing styles and formats. The results demonstrate that the transfer learning method enables us to develop a satisfactory system for a new hospital with only a few annotations and suggest more opportunities to reduce the burden of cancer registry curation.

2019

pdf abs
BIGODM System in the Social Media Mining for Health Applications Shared Task 2019
Chen-Kai Wang | Hong-Jie Dai | Bo-Hung Wang
Proceedings of the Fourth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task

In this study, we describe our methods to automatically classify Twitter posts conveying events of adverse drug reaction (ADR). Based on our previous experience in tackling the ADR classification task, we empirically applied the vote-based under-sampling ensemble approach along with linear support vector machine (SVM) to develop our classifiers as part of our participation in ACL 2019 Social Media Mining for Health Applications (SMM4H) shared task 1. The best-performed model on the test sets were trained on a merged corpus consisting of the datasets released by SMM4H 2017 and 2019. By using VUE, the corpus was randomly under-sampled with 2:1 ratio between the negative and positive classes to create an ensemble using the linear kernel trained with features including bag-of-word, domain knowledge, negation and word embedding. The best performing model achieved an F-measure of 0.551 which is about 5% higher than the average F-scores of 16 teams.

2017

pdf bib
Proceedings of the International Workshop on Digital Disease Detection using Social Media 2017 (DDDSM-2017)
Jitendra Jonnagaddala | Hong-Jie Dai | Yung-Chun Chang
Proceedings of the International Workshop on Digital Disease Detection using Social Media 2017 (DDDSM-2017)

The increasing popularity of social media lead users to share enormous information on the internet. This information has various application like, it can be used to develop models to understand or predict user behavior on social media platforms. For example, few online retailers have studied the shopping patterns to predict shopper’s pregnancy stage. Another interesting application is to use the social media platforms to analyze users’ health-related information. In this study, we developed a tree kernel-based model to classify tweets conveying pregnancy related information using this corpus. The developed pregnancy classification model achieved an accuracy of 0.847 and an F-score of 0.565. A new corpus from popular social media platform Twitter was developed for the purpose of this study. In future, we would like to improve this corpus by reducing noise such as retweets.

pdf abs
Using a Recurrent Neural Network Model for Classification of Tweets Conveyed Influenza-related Information
Chen-Kai Wang | Onkar Singh | Zhao-Li Tang | Hong-Jie Dai
Proceedings of the International Workshop on Digital Disease Detection using Social Media 2017 (DDDSM-2017)

Traditional disease surveillance systems depend on outpatient reporting and virological test results released by hospitals. These data have valid and accurate information about emerging outbreaks but it’s often not timely. In recent years the exponential growth of users getting connected to social media provides immense knowledge about epidemics by sharing related information. Social media can now flag more immediate concerns related to out-breaks in real time. In this paper we apply the long short-term memory recurrent neural net-work (RNN) architecture to classify tweets conveyed influenza-related information and compare its performance with baseline algorithms including support vector machine (SVM), decision tree, naive Bayes, simple logistics, and naive Bayes multinomial. The developed RNN model achieved an F-score of 0.845 on the MedWeb task test set, which outperforms the F-score of SVM without applying the synthetic minority oversampling technique by 0.08. The F-score of the RNN model is within 1% of the highest score achieved by SVM with oversampling technique.

Co-authors

Venues

acl1

Hong-Jie Dai

2020

2019

2017

2016

2015

2014

2011

2010

2006

Co-authors

Venues