Yanru Zhang

2022

pdf abs
zydhjh4593@SMM4H’22: A Generic Pre-trained BERT-based Framework for Social Media Health Text Classification
Chenghao Huang | Xiaolu Chen | Yuxi Chen | Yutong Wu | Weimin Yuan | Yan Wang | Yanru Zhang
Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task

This paper describes our proposed framework for the 10 text classification tasks of Task 1a, 2a, 2b, 3a, 4, 5, 6, 7, 8, and 9, in the Social Media Mining for Health (SMM4H) 2022. According to the pre-trained BERT-based models, various techniques, including regularized dropout, focal loss, exponential moving average, 5-fold cross-validation, ensemble prediction, and pseudo-labeling, are applied for further formulating and improving the generalization performance of our framework. In the evaluation, the proposed framework achieves the 1st place in Task 3a with a 7% higher F1-score than the median, and obtains a 4% higher averaged F1-score than the median in all participating tasks except Task 1a.

pdf abs
yiriyou@SMM4H’22: Stance and Premise Classification in Domain Specific Tweets with Dual-View Attention Neural Networks
Huabin Yang | Zhongjian Zhang | Yanru Zhang
Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task

The paper introduces the methodology proposed for the shared Task 2 of the Social Media Mining for Health Application (SMM4H) in 2022. Task 2 consists of two subtasks: Stance Detection and Premise Classification, named Subtask 2a and Subtask 2b, respectively. Our proposed system is based on dual-view attention neural networks and achieves an F1 score of 0.618 for Subtask 2a (0.068 more than the median) and an F1 score of 0.630 for Subtask 2b (0.017 less than the median). Further experiments show that the domain-specific pre-trained model, cross-validation, and pseudo-label techniques contribute to the improvement of system performance.

pdf abs
uestcc@SMM4H’22: RoBERTa based Adverse Drug Events Classification on Tweets
Chunchen Wei | Ran Bi | Yanru Zhang
Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task

This is a description of our participation in the ADE Mining in English Tweets shared task, organized by the Social Media Mining for Health SMM4H 2022 workshop. We participate in the subtask a of shared Task 1, and the paper introduces the system we developed for solving the task. The task requires classifying the given tweets by whether they mention the Adverse Drug Effects. We utilize RoBERTa model and apply several methods during training and finetuning period. We also try to improve the performance of our system by preprocessing the dataset but improve the precision only. The results of our system on test set are 0.601 in F1- score, 0.705 in precision, and 0.524 in recall.

pdf abs
Zhegu@SMM4H-2022: The Pre-training Tweet & Claim Matching Makes Your Prediction Better
Pan He | Chen YuZe | Yanru Zhang
Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task

SMM4H-2022 (CITATION) Task 2 is to detect whether containing premise in the tweets of users about COVID-19 on the social medias or their stances for the claims. In this paper, we propose Tweet Claim Matching (TCM), which is a new pre-training task constructed by the tweets and claims similarly to Next Sentence Prediction (NSP). We first continue to pre-train the standard pre-trained language models on the labelled dataset and then fine-tune them for obtaining better performance. Compared with the solid baseline (CITATION), we achieve the absolute improvement of 7.9% in Task 2a and obtain the SOTA results.

pdf abs
Yet@SMM4H’22: Improved BERT-based classification models with Rdrop and PolyLoss
Yan Zhuang | Yanru Zhang
Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task

This paper describes our approach for 11 classification tasks (Task1a, Task2a, Task2b, Task3a, Task3b, Task4, Task5, Task6, Task7, Task8 and Task9) from Social Media Mining for Health (SMM4H) 2022 Shared Tasks. We developed a classification model that incorporated Rdrop to augment data and avoid overfitting, Poly Loss and Focal Loss to alleviate sample imbalance, and pseudo labels to improve model performance. The results of our submissions are over or equal to the median scores in almost all tasks. In addition, our model achieved the highest score in Task4, with a higher 7.8% and 5.3% F1-score than the median scores in Task2b and Task3a respectively.

2020

pdf abs
Ferryman at SemEval-2020 Task 3: Bert with TFIDF-Weighting for Predicting the Effect of Context in Word Similarity
Weilong Chen | Xin Yuan | Sai Zhang | Jiehui Wu | Yanru Zhang | Yan Wang
Proceedings of the Fourteenth Workshop on Semantic Evaluation

Word similarity is widely used in machine learning applications like searching engine and recommendation. Measuring the changing meaning of the same word between two different sentences is not only a way to handle complex features in word usage (such as sentence syntax and semantics), but also an important method for different word polysemy modeling. In this paper, we present the methodology proposed by team Ferryman. Our system is based on the Bidirectional Encoder Representations from Transformers (BERT) model combined with term frequency-inverse document frequency (TF-IDF), applying the method on the provided datasets called CoSimLex, which covers four different languages including English, Croatian, Slovene, and Finnish. Our team Ferryman wins the the first position for English task and the second position for Finnish in the subtask 1.

The main purpose of this article is to state the effect of using different methods and models for counterfactual determination and detection of causal knowledge. Nowadays, counterfactual reasoning has been widely used in various fields. In the realm of natural language process(NLP), counterfactual reasoning has huge potential to improve the correctness of a sentence. In the shared Task 5 of detecting counterfactual in SemEval 2020, we pre-process the officially given dataset according to case conversion, extract stem and abbreviation replacement. We use last-5 bidirectional encoder representation from bidirectional encoder representation from transformer (BERT)and term frequency–inverse document frequency (TF-IDF) vectorizer for counterfactual detection. Meanwhile, multi-sample dropout and cross validation are used to improve versatility and prevent problems such as poor generosity caused by overfitting. Finally, our team Ferryman ranked the 8th place in the sub-task 1 of this competition.

Natural language processing (NLP) has been applied to various fields including text classification and sentiment analysis. In the shared task of assessing the funniness of edited news headlines, which is a part of the SemEval 2020 competition, we preprocess datasets by replacing abbreviation, stemming words, then merge three models including Light Gradient Boosting Machine (LightGBM), Long Short-Term Memory (LSTM), and Bidirectional Encoder Representation from Transformer (BERT) by taking the average to perform the best. Our team Ferryman wins the 9th place in Sub-task 1 of Task 7 - Regression.

pdf abs
Ferryman at SemEval-2020 Task 12: BERT-Based Model with Advanced Improvement Methods for Multilingual Offensive Language Identification
Weilong Chen | Peng Wang | Jipeng Li | Yuanshuai Zheng | Yan Wang | Yanru Zhang
Proceedings of the Fourteenth Workshop on Semantic Evaluation

Indiscriminately posting offensive remarks on social media may promote the occurrence of negative events such as violence, crime, and hatred. This paper examines different approaches and models for solving offensive tweet classification, which is a part of the OffensEval 2020 competition. The dataset is Offensive Language Identification Dataset (OLID), which draws 14,200 annotated English Tweet comments. The main challenge of data preprocessing is the unbalanced class distribution, abbreviation, and emoji. To overcome these issues, methods such as hashtag segmentation, abbreviation replacement, and emoji replacement have been adopted for data preprocessing approaches. The main task can be divided into three sub-tasks, and are solved by Term Frequency–Inverse Document Frequency(TF-IDF), Bidirectional Encoder Representation from Transformer (BERT), and Multi-dropout respectively. Meanwhile, we applied different learning rates for different languages and tasks based on BERT and non-BERTmodels in order to obtain better results. Our team Ferryman ranked the 18th, 8th, and 21st with F1-score of 0.91152 on the English Sub-task A, Sub-task B, and Sub-task C, respectively. Furthermore, our team also ranked in the top 20 on the Sub-task A of other languages.