Xiaolong Hou

2023

This paper describes our system used in the SemEval-2023 Task12: Sentiment Analysis for Low-resource African Languages using Twit- ter Dataset (Muhammad et al., 2023c). The AfriSenti-SemEval Shared Task 12 is based on a collection of Twitter datasets in 14 African languages for sentiment classification. It con- sists of three sub-tasks. Task A is a monolin- gual sentiment classification which covered 12 African languages. Task B is a multilingual sen- timent classification which combined training data from Task A (12 African languages). Task C is a zero-shot sentiment classification. We uti- lized various strategies, including monolingual training, multilingual mixed training, and trans- lation technology, and proposed a weighted vot- ing method that combined the results of differ- ent strategies. Substantially, in the monolingual subtask, our system achieved Top-1 in two lan- guages (Yoruba and Twi) and Top-2 in four languages (Nigerian Pidgin, Algerian Arabic, and Swahili, Multilingual). In the multilingual subtask, Our system achived Top-2 in publish leaderBoard.

2022

pdf abs
PINGAN_AI at SemEval-2022 Task 9: Recipe knowledge enhanced model applied in Competence-based Multimodal Question Answering
Zhihao Ruan | Xiaolong Hou | Lianxin Jiang
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

This paper describes our system used in the SemEval-2022 Task 09: R2VQ - Competence-based Multimodal Question Answering. We propose a knowledge-enhanced model for predicting answer in QA task, this model use BERT as the backbone. We adopted two knowledge-enhanced methods in this model: the knowledge auxiliary text method and the knowledge embedding method. We also design an answer extraction task pipeline, which contains an extraction-based model, an automatic keyword labeling module, and an answer generation module. Our system ranked 3rd in task 9 and achieved an exact match score of 78.21 and a word-level F1 score of 82.62.

Pre-trained language models have been widely applied to standard benchmarks. Due to the flexibility of natural language, the available resources in a certain domain can be restricted to support obtaining precise representation. To address this issue, we propose a novel Transformer-based language model named VarMAE for domain-adaptive language understanding. Under the masked autoencoding objective, we design a context uncertainty learning module to encode the token’s context into a smooth latent distribution. The module can produce diverse and well-formed contextual representations. Experiments on science- and finance-domain NLU tasks demonstrate that VarMAE can be efficiently adapted to new domains with limited resources.

2021

pdf abs
RG PA at SemEval-2021 Task 1: A Contextual Attention-based Model with RoBERTa for Lexical Complexity Prediction
Gang Rao | Maochang Li | Xiaolong Hou | Lianxin Jiang | Yang Mo | Jianping Shen
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

In this paper we propose a contextual attention based model with two-stage fine-tune training using RoBERTa. First, we perform the first-stage fine-tune on corpus with RoBERTa, so that the model can learn some prior domain knowledge. Then we get the contextual embedding of context words based on the token-level embedding with the fine-tuned model. And we use Kfold cross-validation to get K models and ensemble them to get the final result. Finally, we attain the 2nd place in the final evaluation phase of sub-task 2 with pearson correlation of 0.8575.

The objective of subtask 2 of SemEval-2021 Task 6 is to identify techniques used together with the span(s) of text covered by each technique. This paper describes the system and model we developed for the task. We first propose a pipeline system to identify spans, then to classify the technique in the input sequence. But it severely suffers from handling the overlapping in nested span. Then we propose to formulize the task as a question answering task by MRC framework which achieves a better result compared to the pipeline method. Moreover, data augmentation and loss design techniques are also explored to alleviate the problem of data sparse and imbalance. Finally, we attain the 3rd place in the final evaluation phase.

2020

This paper describes the model we apply in the SemEval-2020 Task 10. We formalize the task of emphasis selection as a simplified query-based machine reading comprehension (MRC) task, i.e. answering a fixed question of “Find candidates for emphasis”. We propose our subword puzzle encoding mechanism and subword fusion layer to align and fuse subwords. By introducing the semantic prior knowledge of the informative query and some other techniques, we attain the 7th place during the evaluation phase and the first place during train phase.

Co-authors

Dou Hu 1

Venues

semeval5
findings1