Gerardo Sierra

Also published as: Gerardo Sierra-Martínez

2025

pdf bib abs
Disagreement in Metaphor Annotation of Mexican Spanish Science Tweets
Alec Sánchez-Montero | Gemma Bel-Enguix | Sergio-Luis Ojeda-Trueba | Gerardo Sierra
Proceedings of Context and Meaning: Navigating Disagreements in NLP Annotation

Traditional linguistic annotation methods often strive for a gold standard with hard labels as input for natural language processing models, assuming an underlying objective truth for all tasks. However, disagreement among annotators is a common scenario, even for seemingly objective linguistic tasks, and is particularly prominent in figurative language annotation, since multiple valid interpretations can sometimes coexist. This study presents the annotation process for identifying metaphorical tweets within a corpus of 3733 Public Communication of Science texts written in Mexican Spanish, emphasizing inter-annotator disagreement. Using Fleiss’ and Cohen’s Kappa alongside agreement percentages, we evaluated metaphorical language detection through binary classification in three situations: two subsets of the corpus labeled by three different non-expert annotators each, and a subset of disagreement tweets, identified in the non-expert annotation phase, re-labeled by three expert annotators. Our results suggest that expert annotation may improve agreement levels, but does not exclude disagreement, likely due to factors such as the relatively novelty of the genre, the presence of multiple scientific topics, and the blending of specialized and non-specialized discourse. Going further, we propose adopting a learning-from-disagreement approach for capturing diverse annotation perspectives to enhance computational metaphor detection in Mexican Spanish.

2024

pdf bib abs
PCIC at SMM4H 2024: Enhancing Reddit Post Classification on Social Anxiety Using Transformer Models and Advanced Loss Functions
Leon Hecht | Victor Pozos | Helena Gomez Adorno | Gibran Fuentes-Pineda | Gerardo Sierra | Gemma Bel-Enguix
Proceedings of the 9th Social Media Mining for Health Research and Applications (SMM4H 2024) Workshop and Shared Tasks

We present our approach to solving the task of identifying the effect of outdoor activities on social anxiety based on reddit posts. We employed state-of-the-art transformer models enhanced with a combination of advanced loss functions. Data augmentation techniques were also used to address class imbalance within the training set. Our method achieved a macro-averaged F1-score of 0.655 on the test data, surpassing the workshop’s mean F1-Score of 0.519. These findings suggest that integrating weighted loss functions improves the performance of transformer models in classifying unbalanced text data, while data augmentation can improve the model’s ability to generalize.

pdf bib abs
PCICUNAM at WASSA 2024: Cross-lingual Emotion Detection Task with Hierarchical Classification and Weighted Loss Functions
Jesús Vázquez-Osorio | Gerardo Sierra | Helena Gómez-Adorno | Gemma Bel-Enguix
Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis

This paper addresses the shared task of multi-lingual emotion detection in tweets, presented at the Workshop on Computational Approaches to Subjectivity, Sentiment, and Social Media Analysis (WASSA) co-located with the ACL 2024 conference. The task involves predicting emotions from six classes in tweets from five different languages using only English for model training. Our approach focuses on addressing class imbalance through data augmentation, hierarchical classification, and the application of focal loss and weighted cross-entropy loss functions. These methods enhance our transformer-based model’s ability to transfer emotion detection capabilities across languages, resulting in improved performance despite the constraints of limited computational resources.

2018

pdf bib abs
Challenges of language technologies for the indigenous languages of the Americas
Manuel Mager | Ximena Gutierrez-Vasques | Gerardo Sierra | Ivan Meza-Ruiz
Proceedings of the 27th International Conference on Computational Linguistics

Indigenous languages of the American continent are highly diverse. However, they have received little attention from the technological perspective. In this paper, we review the research, the digital resources and the available NLP systems that focus on these languages. We present the main challenges and research questions that arise when distant languages and low-resource scenarios are faced. We would like to encourage NLP research in linguistically rich and diverse areas like the Americas.

pdf bib abs
Sociolinguistic Corpus of WhatsApp Chats in Spanish among College Students
Alejandro Dorantes | Gerardo Sierra | Tlauhlia Yamín Donohue Pérez | Gemma Bel-Enguix | Mónica Jasso Rosales
Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media

This work presents the Sociolinguistic Corpus of WhatsApp Chats in Spanish among College Students, a corpus of raw data for general use. Its purpose is to offer data for the study of of language and interactions via Instant Messaging (IM) among bachelors. Our paper consists of an overview of both the corpus’s content and demographic metadata. Furthermore, it presents the current research being conducted with it —namely parenthetical expressions, orality traits, and code-switching. This work also includes a brief outline of similar corpora and recent studies in the field of IM.

2017

pdf bib
Applying the Rhetorical Structure Theory in Alzheimer patients’ speech
Anayeli Paulino | Gerardo Sierra
Proceedings of the 6th Workshop on Recent Advances in RST and Related Formalisms

2016

pdf bib abs
Axolotl: a Web Accessible Parallel Corpus for Spanish-Nahuatl
Ximena Gutierrez-Vasques | Gerardo Sierra | Isaac Hernandez Pompa
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper describes the project called Axolotl which comprises a Spanish-Nahuatl parallel corpus and its search interface. Spanish and Nahuatl are distant languages spoken in the same country. Due to the scarcity of digital resources, we describe the several problems that arose when compiling this corpus: most of our sources were non-digital books, we faced errors when digitizing the sources and there were difficulties in the sentence alignment process, just to mention some. The documents of the parallel corpus are not homogeneous, they were extracted from different sources, there is dialectal, diachronical, and orthographical variation. Additionally, we present a web search interface that allows to make queries through the whole parallel corpus, the system is capable to retrieve the parallel fragments that contain a word or phrase searched by a user in any of the languages. To our knowledge, this is the first Spanish-Nahuatl public available digital parallel corpus. We think that this resource can be useful to develop language technologies and linguistic studies for this language pair.

pdf bib
Detection of Alzheimer’s disease based on automatic analysis of common objects descriptions
Laura Hernández-Domínguez | Edgar García-Cano | Sylvie Ratté | Gerardo Sierra-Martínez
Proceedings of the 7th Workshop on Cognitive Aspects of Computational Language Learning

2012

pdf bib abs
Using Wikipedia to Validate the Terminology found in a Corpus of Basic Textbooks
Jorge Vivaldi | Luis Adrián Cabrera-Diego | Gerardo Sierra | María Pozzi
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

A scientific vocabulary is a set of terms that designate scientific concepts. This set of lexical units can be used in several applications ranging from the development of terminological dictionaries and machine translation systems to the development of lexical databases and beyond. Even though automatic term recognition systems exist since the 80s, this process is still mainly done by hand, since it generally yields more accurate results, although not in less time and at a higher cost. Some of the reasons for this are the fairly low precision and recall results obtained, the domain dependence of existing tools and the lack of available semantic knowledge needed to validate these results. In this paper we present a method that uses Wikipedia as a semantic knowledge resource, to validate term candidates from a set of scientific text books used in the last three years of high school for mathematics, health education and ecology. The proposed method may be applied to any domain or language (assuming there is a minimal coverage by Wikipedia).

Gerardo Sierra

2025

2024

2018

2017

2016

2012

2011

2010

2009

2008

2000

Co-authors

Venues