Alejandro Mosquera


2022

This paper describes the winning approach in the first automated German text complexity assessment shared task as part of KONVENS 2022. To solve this difficult problem, the evaluated system relies on an ensemble of regression models that successfully combines both traditional feature engineering and pre-trained resources. Moreover, the use of adversarial validation is proposed as a method for countering the data drift identified during the development phase, thus helping to select relevant models and features and avoid leaderboard overfitting. The best submission reached 0.43 mapped RMSE on the test set during the final phase of the competition.
This paper describes the use of AutoNLP techniques applied to the detection of patronizing and condescending language (PCL) in a binary classification scenario. The proposed approach combines meta-learning, in order to identify the best performing combination of deep learning architectures, with the synthesis of adversarial training examples; thus boosting robustness and model generalization. A submission from this system was evaluated as part of the first sub-task of SemEval 2022 - Task 4 and achieved an F1 score of 0.57%, which is 16 percentage points higher than the RoBERTa baseline provided by the organizers.

2021

This paper revisits feature engineering approaches for predicting the complexity level of English words in a particular context using regression techniques. Our best submission to the Lexical Complexity Prediction (LCP) shared task was ranked 3rd out of 48 systems for sub-task 1 and achieved Pearson correlation coefficients of 0.779 and 0.809 for single words and multi-word expressions respectively. The conclusion is that a combination of lexical, contextual and semantic features can still produce strong baselines when compared against human judgement.

2020

This paper describes a method and system to solve the problem of detecting offensive language in social media using anti-adversarial features. Our submission to the SemEval-2020 task 12 challenge was generated by an stacked ensemble of neural networks fine-tuned on the OLID dataset and additional external sources. For Task-A (English), text normalisation filters were applied at both graphical and lexical level. The normalisation step effectively mitigates not only the natural presence of lexical variants but also intentional attempts to bypass moderation by introducing out of vocabulary words. Our approach provides strong F1 scores for both 2020 (0.9134) and 2019 (0.8258) challenges.

2014

2013

2011