Ziyu Yang


2022

pdf
LaoPLM: Pre-trained Language Models for Lao
Nankai Lin | Yingwen Fu | Chuwei Chen | Ziyu Yang | Shengyi Jiang
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Trained on the large corpus, pre-trained language models (PLMs) can capture different levels of concepts in context and hence generate universal language representations. They can benefit from multiple downstream natural language processing (NLP) tasks. Although PTMs have been widely used in most NLP applications, especially for high-resource languages such as English, it is under-represented in Lao NLP research. Previous work on Lao has been hampered by the lack of annotated datasets and the sparsity of language resources. In this work, we construct a text classification dataset to alleviate the resource-scarce situation of the Lao language. In addition, we present the first transformer-based PTMs for Lao with four versions: BERT-Small , BERT-Base , ELECTRA-Small , and ELECTRA-Base . Furthermore, we evaluate them on two downstream tasks: part-of-speech (POS) tagging and text classification. Experiments demonstrate the effectiveness of our Lao models. We release our models and datasets to the community, hoping to facilitate the future development of Lao NLP applications.

pdf
BERT 4EVER@EvaHan 2022: Ancient Chinese Word Segmentation and Part-of-Speech Tagging Based on Adversarial Learning and Continual Pre-training
Hailin Zhang | Ziyu Yang | Yingwen Fu | Ruoyao Ding
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages

With the development of artificial intelligence (AI) and digital humanities, ancient Chinese resources and language technology have also developed and grown, which have become an increasingly important part to the study of historiography and traditional Chinese culture. In order to promote the research on automatic analysis technology of ancient Chinese, we conduct various experiments on ancient Chinese word segmentation and part-of-speech (POS) tagging tasks for the EvaHan 2022 shared task. We model the word segmentation and POS tagging tasks jointly as a sequence tagging problem. In addition, we perform a series of training strategies based on the provided ancient Chinese pre-trained model to enhance the model performance. Concretely, we employ several augmentation strategies, including continual pre-training, adversarial training, and ensemble learning to alleviate the limited amount of training data and the imbalance between POS labels. Extensive experiments demonstrate that our proposed models achieve considerable performance on ancient Chinese word segmentation and POS tagging tasks. Keywords: ancient Chinese, word segmentation, part-of-speech tagging, adversarial learning, continuing pre-training

pdf
BERT 4EVER@LT-EDI-ACL2022-Detecting signs of Depression from Social Media:Detecting Depression in Social Media using Prompt-Learning and Word-Emotion Cluster
Xiaotian Lin | Yingwen Fu | Ziyu Yang | Nankai Lin | Shengyi Jiang
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

In this paper, we report the solution of the team BERT 4EVER for the LT-EDI-2022 shared task2: Homophobia/Transphobia Detection in social media comments in ACL 2022, which aims to classify Youtube comments into one of the following categories: no,moderate, or severe depression. We model the problem as a text classification task and a text generation task and respectively propose two different models for the tasks.To combine the knowledge learned from these two different models, we softly fuse the predicted probabilities of the models above and then select the label with the highest probability as the final output.In addition, multiple augmentation strategies are leveraged to improve the model generalization capability, such as back translation and adversarial training.Experimental results demonstrate the effectiveness of the proposed models and two augmented strategies.

2021

pdf
A Visualization Approach for Rapid Labeling of Clinical Notes for Smoking Status Extraction
Saman Enayati | Ziyu Yang | Benjamin Lu | Slobodan Vucetic
Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances

Labeling is typically the most human-intensive step during the development of supervised learning models. In this paper, we propose a simple and easy-to-implement visualization approach that reduces cognitive load and increases the speed of text labeling. The approach is fine-tuned for task of extraction of patient smoking status from clinical notes. The proposed approach consists of the ordering of sentences that mention smoking, centering them at smoking tokens, and annotating to enhance informative parts of the text. Our experiments on clinical notes from the MIMIC-III clinical database demonstrate that our visualization approach enables human annotators to label sentences up to 3 times faster than with a baseline approach.