2022
pdf
abs
SINAI@SMM4H’22: Transformers for biomedical social media text mining in Spanish
Mariia Chizhikova
|
Pilar López-Úbeda
|
Manuel C. Díaz-Galiano
|
L. Alfonso Ureña-López
|
M. Teresa Martín-Valdivia
Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task
This paper covers participation of the SINAI team in Tasks 5 and 10 of the Social Media Mining for Health (#SSM4H) workshop at COLING-2022. These tasks focus on leveraging Twitter posts written in Spanish for healthcare research. The objective of Task 5 was to classify tweets reporting COVID-19 symptoms, while Task 10 required identifying disease mentions in Twitter posts. The presented systems explore large RoBERTa language models pre-trained on Twitter data in the case of tweet classification task and general-domain data for the disease recognition task. We also present a text pre-processing methodology implemented in both systems and describe an initial weakly-supervised fine-tuning phase alongside with a submission post-processing procedure designed for Task 10. The systems obtained 0.84 F1-score on the Task 5 and 0.77 F1-score on Task 10.
pdf
abs
SHARE: A Lexicon of Harmful Expressions by Spanish Speakers
Flor Miriam Plaza-del-Arco
|
Ana Belén Parras Portillo
|
Pilar López Úbeda
|
Beatriz Gil
|
María-Teresa Martín-Valdivia
Proceedings of the Thirteenth Language Resources and Evaluation Conference
In this paper we present SHARE, a new lexical resource with 10,125 offensive terms and expressions collected from Spanish speakers. We retrieve this vocabulary using an existing chatbot developed to engage a conversation with users and collect insults via Telegram, named Fiero. This vocabulary has been manually labeled by five annotators obtaining a kappa coefficient agreement of 78.8%. In addition, we leverage the lexicon to release the first corpus in Spanish for offensive span identification research named OffendES_spans. Finally, we show the utility of our resource as an interpretability tool to explain why a comment may be considered offensive.
2021
pdf
abs
Identifying professions & occupations in Health-related Social Media using Natural Language Processing
Alberto Mesa Murgado
|
Ana Parras Portillo
|
Pilar López Úbeda
|
Maite Martin
|
Alfonso Ureña-López
Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task
This paper describes the entry of the research group SINAI at SMM4H’s ProfNER task on the identification of professions and occupations in social media related with health. Specifically we have participated in Task 7a: Tweet Binary Classification to determine whether a tweet contains mentions of occupations or not, as well as in Task 7b: NER Offset Detection and Classification aimed at predicting occupations mentions and classify them discriminating by professions and working statuses.
pdf
abs
SINAI at SemEval-2021 Task 5: Combining Embeddings in a BiLSTM-CRF model for Toxic Spans Detection
Flor Miriam Plaza-del-Arco
|
Pilar López-Úbeda
|
L. Alfonso Ureña-López
|
M. Teresa Martín-Valdivia
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)
This paper describes the participation of SINAI team at Task 5: Toxic Spans Detection which consists of identifying spans that make a text toxic. Although several resources and systems have been developed so far in the context of offensive language, both annotation and tasks have mainly focused on classifying whether a text is offensive or not. However, detecting toxic spans is crucial to identify why a text is toxic and can assist human moderators to locate this type of content on social media. In order to accomplish the task, we follow a deep learning-based approach using a Bidirectional variant of a Long Short Term Memory network along with a stacked Conditional Random Field decoding layer (BiLSTM-CRF). Specifically, we test the performance of the combination of different pre-trained word embeddings for recognizing toxic entities in text. The results show that the combination of word embeddings helps in detecting offensive content. Our team ranks 29th out of 91 participants.
2020
pdf
abs
Transfer learning applied to text classification in Spanish radiological reports
Pilar López Úbeda
|
Manuel Carlos Díaz-Galiano
|
L. Alfonso Urena Lopez
|
Maite Martin
|
Teodoro Martín-Noguerol
|
Antonio Luna
Proceedings of the LREC 2020 Workshop on Multilingual Biomedical Text Processing (MultilingualBIO 2020)
Pre-trained text encoders have rapidly advanced the state-of-the-art on many Natural Language Processing tasks. This paper presents the use of transfer learning methods applied to the automatic detection of codes in radiological reports in Spanish. Assigning codes to a clinical document is a popular task in NLP and in the biomedical domain. These codes can be of two types: standard classifications (e.g. ICD-10) or specific to each clinic or hospital. In this study we show a system using specific radiology clinic codes. The dataset is composed of 208,167 radiology reports labeled with 89 different codes. The corpus has been evaluated with three methods using the BERT model applied to Spanish: Multilingual BERT, BETO and XLM. The results are interesting obtaining 70% of F1-score with a pre-trained multilingual model.
2019
pdf
abs
Detecting Anorexia in Spanish Tweets
Pilar López Úbeda
|
Flor Miriam Plaza del Arco
|
Manuel Carlos Díaz Galiano
|
L. Alfonso Urena Lopez
|
Maite Martin
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
Mental health is one of the main concerns of today’s society. Early detection of symptoms can greatly help people with mental disorders. People are using social networks more and more to express emotions, sentiments and mental states. Thus, the treatment of this information using NLP technologies can be applied to the automatic detection of mental problems such as eating disorders. However, the first step to solving the problem should be to provide a corpus in order to evaluate our systems. In this paper, we specifically focus on detecting anorexia messages on Twitter. Firstly, we have generated a new corpus of tweets extracted from different accounts including anorexia and non-anorexia messages in Spanish. The corpus is called SAD: Spanish Anorexia Detection corpus. In order to validate the effectiveness of the SAD corpus, we also propose several machine learning approaches for automatically detecting anorexia symptoms in the corpus. The good results obtained show that the application of textual classification methods is a promising option for developing this kind of system demonstrating that these tools could be used by professionals to help in the early detection of mental problems.
pdf
abs
Using Machine Learning and Deep Learning Methods to Find Mentions of Adverse Drug Reactions in Social Media
Pilar López Úbeda
|
Manuel Carlos Díaz Galiano
|
Maite Martin
|
L. Alfonso Urena Lopez
Proceedings of the Fourth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task
Over time the use of social networks is becoming very popular platforms for sharing health related information. Social Media Mining for Health Applications (SMM4H) provides tasks such as those described in this document to help manage information in the health domain. This document shows the first participation of the SINAI group. We study approaches based on machine learning and deep learning to extract adverse drug reaction mentions from Twitter. The results obtained in the tasks are encouraging, we are close to the average of all participants and even above in some cases.
pdf
abs
Using Snomed to recognize and index chemical and drug mentions.
Pilar López Úbeda
|
Manuel Carlos Díaz Galiano
|
L. Alfonso Urena Lopez
|
Maite Martin
Proceedings of the 5th Workshop on BioNLP Open Shared Tasks
In this paper we describe a new named entity extraction system. Our work proposes a system for the identification and annotation of drug names in Spanish biomedical texts based on machine learning and deep learning models. Subsequently, a standardized code using Snomed is assigned to these drugs, for this purpose, Natural Language Processing tools and techniques have been used, and a dictionary of different sources of information has been built. The results are promising, we obtain 78% in F1 score on the first sub-track and in the second task we map with Snomed correctly 72% of the found entities.