Yingwen Fu


2022

pdf
LaoPLM: Pre-trained Language Models for Lao
Nankai Lin | Yingwen Fu | Chuwei Chen | Ziyu Yang | Shengyi Jiang
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Trained on the large corpus, pre-trained language models (PLMs) can capture different levels of concepts in context and hence generate universal language representations. They can benefit from multiple downstream natural language processing (NLP) tasks. Although PTMs have been widely used in most NLP applications, especially for high-resource languages such as English, it is under-represented in Lao NLP research. Previous work on Lao has been hampered by the lack of annotated datasets and the sparsity of language resources. In this work, we construct a text classification dataset to alleviate the resource-scarce situation of the Lao language. In addition, we present the first transformer-based PTMs for Lao with four versions: BERT-Small , BERT-Base , ELECTRA-Small , and ELECTRA-Base . Furthermore, we evaluate them on two downstream tasks: part-of-speech (POS) tagging and text classification. Experiments demonstrate the effectiveness of our Lao models. We release our models and datasets to the community, hoping to facilitate the future development of Lao NLP applications.

pdf
Effective Unsupervised Constrained Text Generation based on Perturbed Masking
Yingwen Fu | Wenjie Ou | Zhou Yu | Yue Lin
Findings of the Association for Computational Linguistics: ACL 2022

Unsupervised constrained text generation aims to generate text under a given set of constraints without any supervised data. Current state-of-the-art methods stochastically sample edit positions and actions, which may cause unnecessary search steps. In this paper, we propose PMCTG to improve effectiveness by searching for the best edit position and action in each step. Specifically, PMCTG extends perturbed masking technique to effectively search for the most incongruent token to edit. Then it introduces four multi-aspect scoring functions to select edit action to further reduce search difficulty. Since PMCTG does not require supervised data, it could be applied to different generation tasks. We show that under the unsupervised setting, PMCTG achieves new state-of-the-art results in two representative tasks, namely keywords- to-sentence generation and paraphrasing.

pdf
BERT 4EVER@EvaHan 2022: Ancient Chinese Word Segmentation and Part-of-Speech Tagging Based on Adversarial Learning and Continual Pre-training
Hailin Zhang | Ziyu Yang | Yingwen Fu | Ruoyao Ding
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages

With the development of artificial intelligence (AI) and digital humanities, ancient Chinese resources and language technology have also developed and grown, which have become an increasingly important part to the study of historiography and traditional Chinese culture. In order to promote the research on automatic analysis technology of ancient Chinese, we conduct various experiments on ancient Chinese word segmentation and part-of-speech (POS) tagging tasks for the EvaHan 2022 shared task. We model the word segmentation and POS tagging tasks jointly as a sequence tagging problem. In addition, we perform a series of training strategies based on the provided ancient Chinese pre-trained model to enhance the model performance. Concretely, we employ several augmentation strategies, including continual pre-training, adversarial training, and ensemble learning to alleviate the limited amount of training data and the imbalance between POS labels. Extensive experiments demonstrate that our proposed models achieve considerable performance on ancient Chinese word segmentation and POS tagging tasks. Keywords: ancient Chinese, word segmentation, part-of-speech tagging, adversarial learning, continuing pre-training

pdf
BERT 4EVER@LT-EDI-ACL2022-Detecting signs of Depression from Social Media:Detecting Depression in Social Media using Prompt-Learning and Word-Emotion Cluster
Xiaotian Lin | Yingwen Fu | Ziyu Yang | Nankai Lin | Shengyi Jiang
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

In this paper, we report the solution of the team BERT 4EVER for the LT-EDI-2022 shared task2: Homophobia/Transphobia Detection in social media comments in ACL 2022, which aims to classify Youtube comments into one of the following categories: no,moderate, or severe depression. We model the problem as a text classification task and a text generation task and respectively propose two different models for the tasks.To combine the knowledge learned from these two different models, we softly fuse the predicted probabilities of the models above and then select the label with the highest probability as the final output.In addition, multiple augmentation strategies are leveraged to improve the model generalization capability, such as back translation and adversarial training.Experimental results demonstrate the effectiveness of the proposed models and two augmented strategies.