Marta Villegas

2022

pdf abs
ParlamentParla: A Speech Corpus of Catalan Parliamentary Sessions
Baybars Kulebi | Carme Armentano-Oller | Carlos Rodriguez-Penagos | Marta Villegas
Proceedings of the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference

Recently, various end-to-end architectures of Automatic Speech Recognition (ASR) are being showcased as an important step towards providing language technologies to all languages instead of a select few such as English. However many languages are still suffering due to the “digital gap,” lacking thousands of hours of transcribed speech data openly accessible that is necessary to train modern ASR architectures. Although Catalan already has access to various open speech corpora, these corpora lack diversity and are limited in total volume. In order to address this lack of resources for Catalan language, in this work we present ParlamentParla, a corpus of more than 600 hours of speech from Catalan Parliament sessions. This corpus has already been used in training of state-of-the-art ASR systems, and proof-of-concept text-to-speech (TTS) models. In this work we explain in detail the pipeline that allows the information publicly available on the parliamentary website to be converted to a speech corpus compatible with training of ASR and possibly TTS models.

This work presents the first large-scale biomedical Spanish language models trained from scratch, using large biomedical corpora consisting of a total of 1.1B tokens and an EHR corpus of 95M tokens. We compared them against general-domain and other domain-specific models for Spanish on three clinical NER tasks. As main results, our models are superior across the NER tasks, rendering them more convenient for clinical NLP applications. Furthermore, our findings indicate that when enough data is available, pre-training from scratch is better than continual pre-training when tested on clinical tasks, raising an exciting research question about which approach is optimal. Our models and fine-tuning scripts are publicly available at HuggingFace and GitHub.

2021

pdf
Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan
Jordi Armengol-Estapé | Casimiro Pio Carrino | Carlos Rodriguez-Penagos | Ona de Gibert Bonet | Carme Armentano-Oller | Aitor Gonzalez-Agirre | Maite Melero | Marta Villegas
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2019

pdf abs
PharmaCoNER: Pharmacological Substances, Compounds and proteins Named Entity Recognition track
Aitor Gonzalez-Agirre | Montserrat Marimon | Ander Intxaurrondo | Obdulia Rabal | Marta Villegas | Martin Krallinger
Proceedings of the 5th Workshop on BioNLP Open Shared Tasks

One of the biomedical entity types of relevance for medicine or biosciences are chemical compounds and drugs. The correct detection these entities is critical for other text mining applications building on them, such as adverse drug-reaction detection, medication-related fake news or drug-target extraction. Although a significant effort was made to detect mentions of drugs/chemicals in English texts, so far only very limited attempts were made to recognize them in medical documents in other languages. Taking into account the growing amount of medical publications and clinical records written in Spanish, we have organized the first shared task on detecting drug and chemical entities in Spanish medical documents. Additionally, we included a clinical concept-indexing sub-track asking teams to return SNOMED-CT identifiers related to drugs/chemicals for a collection of documents. For this task, named PharmaCoNER, we generated annotation guidelines together with a corpus of 1,000 manually annotated clinical case studies. A total of 22 teams participated in the sub-track 1, (77 system runs), and 7 teams in the sub-track 2 (19 system runs). Top scoring teams used sophisticated deep learning approaches yielding very competitive results with F-measures above 0.91. These results indicate that there is a real interest in promoting biomedical text mining efforts beyond English. We foresee that the PharmaCoNER annotation guidelines, corpus and participant systems will foster the development of new resources for clinical and biomedical text mining systems of Spanish medical data.

pdf abs
Medical Word Embeddings for Spanish: Development and Evaluation
Felipe Soares | Marta Villegas | Aitor Gonzalez-Agirre | Martin Krallinger | Jordi Armengol-Estapé
Proceedings of the 2nd Clinical Natural Language Processing Workshop

Word embeddings are representations of words in a dense vector space. Although they are not recent phenomena in Natural Language Processing (NLP), they have gained momentum after the recent developments of neural methods and Word2Vec. Regarding their applications in medical and clinical NLP, they are invaluable resources when training in-domain named entity recognition systems, classifiers or taggers, for instance. Thus, the development of tailored word embeddings for medical NLP is of great interest. However, we identified a gap in the literature which we aim to fill in this paper: the availability of embeddings for medical NLP in Spanish, as well as a standardized form of intrinsic evaluation. Since most work has been done for English, some established datasets for intrinsic evaluation are already available. In this paper, we show the steps we employed to adapt such datasets for the first time to Spanish, of particular relevance due to the considerable volume of EHRs in this language, as well as the creation of in-domain medical word embeddings for the Spanish using the state-of-the-art FastText model. We performed intrinsic evaluation with our adapted datasets, as well as extrinsic evaluation with a named entity recognition systems using a baseline embedding of general-domain. Both experiments proved that our embeddings are suitable for use in medical NLP in the Spanish language, and are more accurate than general-domain ones.

2016

pdf abs
Leveraging RDF Graphs for Crossing Multiple Bilingual Dictionaries
Marta Villegas | Maite Melero | Núria Bel | Jorge Gracia
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The experiments presented here exploit the properties of the Apertium RDF Graph, principally cycle density and nodes’ degree, to automatically generate new translation relations between words, and therefore to enrich existing bilingual dictionaries with new entries. Currently, the Apertium RDF Graph includes data from 22 Apertium bilingual dictionaries and constitutes a large unified array of linked lexical entries and translations that are available and accessible on the Web (http://linguistic.linkeddata.es/apertium/). In particular, its graph structure allows for interesting exploitation opportunities, some of which are addressed in this paper. Two ‘massive’ experiments are reported: in the first one, the original EN-ES translation set was removed from the Apertium RDF Graph and a new EN-ES version was generated. The results were compared against the previously removed EN-ES data and against the Concise Oxford Spanish Dictionary. In the second experiment, a new non-existent EN-FR translation set was generated. In this case the results were compared against a converted wiktionary English-French file. The results we got are really good and perform well for the extreme case of correlated polysemy. This lead us to address the possibility to use cycles and nodes degree to identify potential oddities in the source data. If cycle density proves efficient when considering potential targets, we can assume that in dense graphs nodes with low degree may indicate potential errors.

2014

pdf abs
Metadata as Linked Open Data: mapping disparate XML metadata registries into one RDF/OWL registry.
Marta Villegas | Maite Melero | Núria Bel
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The proliferation of different metadata schemas and models pose serious problems of interoperability. Maintaining isolated repositories with overlapping data is costly in terms of time and effort. In this paper, we describe how we have achieved a Linked Open Data version of metadata descriptions coming from heterogeneous sources, originally encoded in XML. The resulting model is much simpler than the original XSD schema and avoids problems typical of XML syntax, such as semantic ambiguity and order constraint. Moreover, the open world assumption of RDF/OWL allows to naturally integrate objects from different schemas and to add further extensions, facilitating merging of different models as well as linking to external data. Apart from the advantages in terms of interoperability and maintainability, the merged repository enables end-users to query multiple sources using a unified schema and is able to present them with implicit knowledge derived from the linked data. The approach we present here is easily scalable to any number of sources and schemas.

2012

This paper describes on-going work for the construction of a new treebank for Spanish, The IULA Treebank. This new resource will contain about 60,000 richly annotated sentences as an extension of the already existing IULA Technical Corpus which is only PoS tagged. In this paper we have focused on describing the work done for defining the annotation process and the treebank design principles. We report on how the used framework, the DELPH-IN processing framework, has been crucial in the design principles and in the bootstrapping strategy followed, especially in what refers to the use of stochastic modules for reducing parsing overgeneration. We also report on the different evaluation experiments carried out to guarantee the quality of the already available results.

pdf abs
Using Language Resources in Humanities research
Marta Villegas | Nuria Bel | Carlos Gonzalo | Amparo Moreno | Nuria Simelio
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper we present two real cases, in the fields of discourse analysis of newspapers and communication research which demonstrate the impact of Language Resources (LR) and NLP in the humanities. We describe our collaboration with (i) the Feminario research group from the UAB which has been investigating androcentric practices in Spanish general press since the 80s and whose research suggests that Spanish general press has undergone a dehumanization process that excludes women and men and (ii) the Municipals'11 online project which investigates the Spanish local election campaign in the blogosphere. We will see how NLP tools and LRs make possible the so called e-Humanities research' as they provide Humanities with tools to perform intensive and automatic text analyses. Language technologies have evolved a lot and are mature enough to provide useful tools to researchers dealing with large amount of textual data. The language resources that have been developed within the field of NLP have proven to be useful for other disciplines that are unaware of their existence and nevertheless would greatly benefit from them as they provide (i) exhaustiveness -to guarantee that data coverage is wide and representative enough- and (ii) reliable and significant results -to guarantee that the reported results are statistically significant.

2010

pdf abs
A Case Study on Interoperability for Language Resources and Applications
Marta Villegas | Núria Bel | Santiago Bel | Víctor Rodríguez
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper reports our experience when integrating differ resources and services into a grid environment. The use case we address implies the deployment of several NLP applications as web services. The ultimate objective of this task was to create a scenario where researchers have access to a variety of services they can operate. These services should be easy to invoke and able to interoperate between one another. We essentially describe the interoperability problems we faced, which involve metadata interoperability, data interoperability and service interoperability. We devote special attention to service interoperability and explore the possibility to define common interfaces and semantic description of services. While the web services paradigm suits the integration of different services very well, this requires mutual understanding and the accommodation to common interfaces that not only provide technical solution but also ease the userâs work. Defining common interfaces benefits interoperability but requires the agreement about operations and the set of inputs/outputs. Semantic annotation allows defining some sort of taxonomy that organizes and collects the set of admissible operations and types input/output parameters.

2008

pdf abs
COLDIC, a Lexicographic Platform for LMF compliant lexica
Núria Bel | Sergio Espeja | Montserrat Marimon | Marta Villegas
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Despite of the importance of lexical resources for a number of NLP applications (Machine Translation, Information Extraction, Question Answering, among others), there has been a traditional lack of generic tools for the creation, maintenance and management of computational lexica. The most direct obstacle for the development of generic tools, independent of any particular application format, was the lack of standards for the description and encoding of lexical resources. The availability of the Lexical Markup Framework (LMF) has changed this scenario and has made it possible the development of generic lexical platforms. COLDIC is a generic platform for working with computational lexica. The system has been designed to let the user concentrate on lexicographical tasks, but still being autonomous in the management of the tools. The creation and maintenance of the database, which is the core of the tool, demand no specific training in databases. A LMF compliant schema implemented in a Document Type Definition (DTD) describing the lexical resources is taken by the system to automatically configure the platform. Besides, the most standard web services for interoperability are also generated automatically. Other components of the platform include build-in functions supporting the most common tasks of the lexicographic work.

The ISLE project is a continuation of the long standing EAGLES initiative, carried out under the Human Language Technology (HLT) programme in collaboration between American and European groups in the framework of the EU-US International Research Co-operation, supported by NSF and EC. In this paper we concentrate on the current position of the ISLE Computational Lexicon Working Group (CLWG), whose activities aim at defining a general schema for a multilingual lexical entry (MILE), as the basis for a standard framework for multilingual computational lexicons. The needs and features of existing Machine Translation systems provide the main reference points for the process of consensual definition of the MILE. The overall structure of the MILE will be illustrated with particular attention to some of the issues raised for multilingual lexicons by the need of expressing complex transfer conditions among translation equivalents

2000

Co-authors

Venues

clinicalnlp1