Laura Alonso Alemany

Also published as: Laura Alonso, Laura Alonso i Alemany

2024

Warning: This paper contains explicit statements of offensive stereotypes which may be upsetting The study of bias, fairness and social impact in Natural Language Processing (NLP) lacks resources in languages other than English. Our objective is to support the evaluation of bias in language models in a multilingual setting. We use stereotypes across nine types of biases to build a corpus containing contrasting sentence pairs, one sentence that presents a stereotype concerning an underadvantaged group and another minimally changed sentence, concerning a matching advantaged group. We build on the French CrowS-Pairs corpus and guidelines to provide translations of the existing material into seven additional languages. In total, we produce 11,139 new sentence pairs that cover stereotypes dealing with nine types of biases in seven cultural contexts. We use the final resource for the evaluation of relevant monolingual and multilingual masked language models. We find that language models in all languages favor sentences that express stereotypes in most bias categories. The process of creating a resource that covers a wide range of language types and cultural settings highlights the difficulty of bias evaluation, in particular comparability across languages and contexts.

2023

Approaches to bias assessment usually require such technical skills that, by design, they leave discrimination experts out. In this paper we present EDIA, a tool that facilitates that experts in discrimination explore social biases in word embeddings and masked language models. Experts can then characterize those biases so that their presence can be assessed more systematically, and actions can be planned to address them. They can work interactively to assess the effects of different characterizations of bias in a given word embedding or language model, which helps to specify informal intuitions in concrete resources for systematic testing.

pdf abs
Which Argumentative Aspects of Hate Speech in Social Media can be reliably identified?
Damián Ariel Furman | Pablo Torres | José A. Rodríguez | Laura Alonso Alemany | Diego Letzen | Vanina Martínez
Proceedings of the Fourth International Workshop on Designing Meaning Representations

The expansion of Large Language Models (LLMs) into more serious areas of application, involving decision-making and the forming of public opinion, calls for a more thoughtful treatment of texts. Augmenting them with explicit and understandable argumentative analysis could foster a more reasoned usage of chatbots, text completion mechanisms or other applications. However, it is unclear which aspects of argumentation can be reliably identified and integrated by them. In this paper we propose an adaptation of Wagemans (2016)’s Periodic Table of Arguments to identify different argumentative aspects of texts, with a special focus on hate speech in social media. We have empirically assessed the reliability with which each of these aspects can be automatically identified. We analyze the implications of these results, and how to adapt the proposal to obtain reliable representations of those that cannot be successfully identified.

2022

pdf abs
RoBERTuito: a pre-trained language model for social media text in Spanish
Juan Manuel Pérez | Damián Ariel Furman | Laura Alonso Alemany | Franco M. Luque
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Since BERT appeared, Transformer language models and transfer learning have become state-of-the-art for natural language processing tasks. Recently, some works geared towards pre-training specially-crafted models for particular domains, such as scientific papers, medical documents, user-generated texts, among others. These domain-specific models have been shown to improve performance significantly in most tasks; however, for languages other than English, such models are not widely available. In this work, we present RoBERTuito, a pre-trained language model for user-generated text in Spanish, trained on over 500 million tweets. Experiments on a benchmark of tasks involving user-generated text showed that RoBERTuito outperformed other pre-trained language models in Spanish. In addition to this, our model has some cross-lingual abilities, achieving top results for English-Spanish tasks of the Linguistic Code-Switching Evaluation benchmark (LinCE) and also competitive performance against monolingual models in English Twitter tasks. To facilitate further research, we make RoBERTuito publicly available at the HuggingFace model hub together with the dataset used to pre-train it.

2018

pdf
Increasing Argument Annotation Reproducibility by Using Inter-annotator Agreement to Improve Guidelines
Milagro Teruel | Cristian Cardellino | Fernando Cardellino | Laura Alonso Alemany | Serena Villata
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf abs
Legal NERC with ontologies, Wikipedia and curriculum learning
Cristian Cardellino | Milagro Teruel | Laura Alonso Alemany | Serena Villata
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

In this paper, we present a Wikipedia-based approach to develop resources for the legal domain. We establish a mapping between a legal domain ontology, LKIF (Hoekstra et al. 2007), and a Wikipedia-based ontology, YAGO (Suchanek et al. 2007), and through that we populate LKIF. Moreover, we use the mentions of those entities in Wikipedia text to train a specific Named Entity Recognizer and Classifier. We find that this classifier works well in the Wikipedia, but, as could be expected, performance decreases in a corpus of judgments of the European Court of Human Rights. However, this tool will be used as a preprocess for human annotation. We resort to a technique called “curriculum learning” aimed to overcome problems of overfitting by learning increasingly more complex concepts. However, we find that in this particular setting, the method works best by learning from most specific to most general concepts, not the other way round.

pdf
As bases de dados verbais ADESSE e ViPEr: uma análise constrastiva das construções locativas em espanhol e em português (The verbal databases ADESSE and ViPEr: a contrastive analysis of locative constructs in Spanish and Portuguese)[In Portuguese]
Roana Rodrigues | Oto Vale | Laura Alonso Alemany
Proceedings of the 11th Brazilian Symposium in Information and Human Language Technology

2010

pdf bib
Data-driven computational linguistics at FaMAF-UNC, Argentina
Laura Alonso Alemany | Gabriel Infante-Lopez
Proceedings of the NAACL HLT 2010 Young Investigators Workshop on Computational Approaches to Languages of the Americas

pdf
IRASubcat, a highly parametrizable, language independent tool for the acquisition of verbal subcategorization information from corpus
Ivana Romina Altamirano | Laura Alonso Alemany
Proceedings of the NAACL HLT 2010 Young Investigators Workshop on Computational Approaches to Languages of the Americas

2006

pdf abs
The Sensem Corpus: a Corpus Annotated at the Syntactic and Semantic Level
Irene Castellón | Ana Fernández-Montraveta | Gloria Vázquez | Laura Alonso Alemany | Joan Antoni Capilla
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The primary aim of the project SENSEM (Sentence Semantics, BFF2003-06456) is the construction of a Lexical Data Base illustrating the syntactic and semantic behavior of each of the senses of the 250 most frequent verbs of Spanish. With this objective in mind, we are currently building an annotated corpus consisting of sentences extracted from the electronic version of the newspaper El Periódico de Catalunya, totalling approximately 1 million words, with 100 examples of each verb. By the time of the conference, we will be about to complete the annotation of 25,000 sentences, which means roughly a corpus of 800,000 words. Approximately 400,000 of them will have been revised. We expect to make the corpus publicly available by the end of 2006.

2004

pdf bib
A Framework for Feature based Description of Low level Discourse
Laura Alonso Alemany | Ezequiel Andujar Hinojosa | Robert Sola Salvatierra
Proceedings of the Workshop on Discourse Annotation

pdf
Knowledge intensive e-mail summarization in CARPANTA
Laura Alonso | Irene Castellón | Bernardino Casas | Lluís Padró
Proceedings of the ACL Interactive Poster and Demonstration Sessions

pdf abs
Multiple Sequence Alignment for Characterizing the Lineal Structure of Revision
Laura Alonso | Irene Castellón | Jordi Escribano | Xavier Messeguer | Lluís Padró
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

We present a first approach to the application of a data mining technique, Multiple Sequence Alignment, to the systematization of a polemic aspect of discourse, namely, the expression of contrast, concession, counterargument and semantically similar discursive relations. The representation of the phenomena under study is carried out by very simple techniques, mostly pattern-matching, but the results allow to drive insightful conclusions on the organization of this aspect of discourse: equivalence classes of discourse markers are established, and systematic patterns are discovered, which will be applied in enhancing a discursive parser.

pdf
Re-using High-quality Resources for Continued Evaluation of Automated Summarization Systems
Laura Alonso | Maria Fuentes | Marc Massot | Horacio Rodríguez
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf
Semantic Categorization of Spanish Se-constructions
Glòria Vázquez | Ana Fernández Montraveta | Irene Castellón | Laura Alonso
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)