2021
pdf
abs
MIN_PT: An European Portuguese Lexicon for Minorities Related Terms
Paula Fortuna
|
Vanessa Cortez
|
Miguel Sozinho Ramalho
|
Laura Pérez-Mayos
Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)
Hate speech-related lexicons have been proved to be useful for many tasks such as data collection and classification. However, existing Portuguese lexicons do not distinguish between European and Brazilian Portuguese, and do not include neutral terms that are potentially useful to detect a broader spectrum of content referring to minorities. In this work, we present MIN_PT, a new European Portuguese Lexicon for Minorities-Related Terms specifically designed to tackle the limitations of existing resources. We describe the data collection and annotation process, discuss the limitation and ethical concerns, and prove the utility of the resource by applying it to a use case for the Portuguese 2021 presidential elections.
pdf
abs
Cartography of Natural Language Processing for Social Good (NLP4SG): Searching for Definitions, Statistics and White Spots
Paula Fortuna
|
Laura Pérez-Mayos
|
Ahmed AbuRa’ed
|
Juan Soler-Company
|
Leo Wanner
Proceedings of the 1st Workshop on NLP for Positive Impact
The range of works that can be considered as developing NLP for social good (NLP4SG) is enormous. While many of them target the identification of hate speech or fake news, there are others that address, e.g., text simplification to alleviate consequences of dyslexia, or coaching strategies to fight depression. However, so far, there is no clear picture of what areas are targeted by NLP4SG, who are the actors, which are the main scenarios and what are the topics that have been left aside. In order to obtain a clearer view in this respect, we first propose a working definition of NLP4SG and identify some primary aspects that are crucial for NLP4SG, including, e.g., areas, ethics, privacy and bias. Then, we draw upon a corpus of around 50,000 articles downloaded from the ACL Anthology. Based on a list of keywords retrieved from the literature and revised in view of the task, we select from this corpus articles that can be considered to be on NLP4SG according to our definition and analyze them in terms of trends along the time line, etc. The result is a map of the current NLP4SG research and insights concerning the white spots on this map.
pdf
abs
On the evolution of syntactic information encoded by BERT’s contextualized representations
Laura Pérez-Mayos
|
Roberto Carlini
|
Miguel Ballesteros
|
Leo Wanner
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
The adaptation of pretrained language models to solve supervised tasks has become a baseline in NLP, and many recent works have focused on studying how linguistic information is encoded in the pretrained sentence representations. Among other information, it has been shown that entire syntax trees are implicitly embedded in the geometry of such models. As these models are often fine-tuned, it becomes increasingly important to understand how the encoded knowledge evolves along the fine-tuning. In this paper, we analyze the evolution of the embedded syntax trees along the fine-tuning process of BERT for six different tasks, covering all levels of the linguistic structure. Experimental results show that the encoded syntactic information is forgotten (PoS tagging), reinforced (dependency and constituency parsing) or preserved (semantics-related tasks) in different ways along the fine-tuning process depending on the task.
pdf
Assessing the Syntactic Capabilities of Transformer-based Multilingual Language Models
Laura Pérez-Mayos
|
Alba Táboas García
|
Simon Mille
|
Leo Wanner
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
pdf
abs
How much pretraining data do language models need to learn syntax?
Laura Pérez-Mayos
|
Miguel Ballesteros
|
Leo Wanner
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Transformers-based pretrained language models achieve outstanding results in many well-known NLU benchmarks. However, while pretraining methods are very convenient, they are expensive in terms of time and resources. This calls for a study of the impact of pretraining data size on the knowledge of the models. We explore this impact on the syntactic capabilities of RoBERTa, using models trained on incremental sizes of raw text data. First, we use syntactic structural probes to determine whether models pretrained on more data encode a higher amount of syntactic information. Second, we perform a targeted syntactic evaluation to analyze the impact of pretraining data size on the syntactic generalization performance of the models. Third, we compare the performance of the different models on three downstream applications: part-of-speech tagging, dependency parsing and paraphrase identification. We complement our study with an analysis of the cost-benefit trade-off of training such models. Our experiments show that while models pretrained on more data encode more syntactic knowledge and perform better on downstream applications, they do not always offer a better performance across the different syntactic phenomena and come at a higher financial and environmental cost.