2024
pdf
abs
Becoming a High-Resource Language in Speech: The Catalan Case in the Common Voice Corpus
Carme Armentano-Oller
|
Montserrat Marimon
|
Marta Villegas
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Collecting voice resources for speech recognition systems is a multifaceted challenge, involving legal, technical, and diversity considerations. However, it is crucial to ensure fair access to voice-driven technology across diverse linguistic backgrounds. We describe an ongoing effort to create an extensive, high-quality, publicly available voice dataset for future development of speech technologies in Catalan through the Mozilla Common Voice crowd-sourcing platform. We detail the specific approaches used to address the challenges faced in recruiting contributors and managing the collection, validation, and recording of sentences. This detailed overview can serve as a source of guidance for similar initiatives across other projects and linguistic contexts. The success of this project is evident in the latest corpus release, version 16.1, where Catalan ranks as the most prominent language in the corpus, both in terms of recorded hours and when considering validated hours. This establishes Catalan as a language with significant speech resources for language technology development and significantly raises its international visibility.
pdf
abs
Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan
Aitor Gonzalez-Agirre
|
Montserrat Marimon
|
Carlos Rodriguez-Penagos
|
Javier Aula-Blasco
|
Irene Baucells
|
Carme Armentano-Oller
|
Jorge Palomar-Giner
|
Baybars Kulebi
|
Marta Villegas
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Current LLM-based applications are becoming steadily available for everyone with a reliable access to technology and the internet. These applications offer benefits to their users that leave those without access to them at a serious disadvantage. Given the vastly large amount of data needed to train LLMs, the gap between languages with access to such quantity of data and those without it is currently larger than ever. Aimed at saving this gap, the Aina Project was created to provide Catalan with the necessary resources to keep being relevant in the context of AI/NLP applications based on LLMs. We thus present a set of strategies to consider when improving technology support for a mid- or low-resource language, specially addressing sustainability of high-quality data acquisition and the challenges involved in the process. We also introduce a large amount of new annotated data for Catalan. Our hope is that those interested in replicating this work for another language can learn from what worked for us, the challenges that we faced, and the sometimes disheartening truth of working with mid- and low-resource languages.
2019
pdf
bib
abs
PharmaCoNER: Pharmacological Substances, Compounds and proteins Named Entity Recognition track
Aitor Gonzalez-Agirre
|
Montserrat Marimon
|
Ander Intxaurrondo
|
Obdulia Rabal
|
Marta Villegas
|
Martin Krallinger
Proceedings of the 5th Workshop on BioNLP Open Shared Tasks
One of the biomedical entity types of relevance for medicine or biosciences are chemical compounds and drugs. The correct detection these entities is critical for other text mining applications building on them, such as adverse drug-reaction detection, medication-related fake news or drug-target extraction. Although a significant effort was made to detect mentions of drugs/chemicals in English texts, so far only very limited attempts were made to recognize them in medical documents in other languages. Taking into account the growing amount of medical publications and clinical records written in Spanish, we have organized the first shared task on detecting drug and chemical entities in Spanish medical documents. Additionally, we included a clinical concept-indexing sub-track asking teams to return SNOMED-CT identifiers related to drugs/chemicals for a collection of documents. For this task, named PharmaCoNER, we generated annotation guidelines together with a corpus of 1,000 manually annotated clinical case studies. A total of 22 teams participated in the sub-track 1, (77 system runs), and 7 teams in the sub-track 2 (19 system runs). Top scoring teams used sophisticated deep learning approaches yielding very competitive results with F-measures above 0.91. These results indicate that there is a real interest in promoting biomedical text mining efforts beyond English. We foresee that the PharmaCoNER annotation guidelines, corpus and participant systems will foster the development of new resources for clinical and biomedical text mining systems of Spanish medical data.
2018
pdf
Coreference Resolution in FreeLing 4.0
Montserrat Marimon
|
Lluís Padró
|
Jordi Turmo
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2017
pdf
abs
Annotation of negation in the IULA Spanish Clinical Record Corpus
Montserrat Marimon
|
Jorge Vivaldi
|
Núria Bel
Proceedings of the Workshop Computational Semantics Beyond Events and Roles
This paper presents the IULA Spanish Clinical Record Corpus, a corpus of 3,194 sentences extracted from anonymized clinical records and manually annotated with negation markers and their scope. The corpus was conceived as a resource to support clinical text-mining systems, but it is also a useful resource for other Natural Language Processing systems handling clinical texts: automatic encoding of clinical records, diagnosis support, term extraction, among others, as well as for the study of clinical texts. The corpus is publicly available with a CC-BY-SA 3.0 license.
2014
pdf
bib
Squibs: Automatic Selection of HPSG-Parsed Sentences for Treebank Construction
Montserrat Marimon
|
Núria Bel
|
Lluís Padró
Computational Linguistics, Volume 40, Issue 3 - September 2014
pdf
abs
MultiVal - towards a multilingual valence lexicon
Lars Hellan
|
Dorothee Beermann
|
Tore Bruland
|
Mary Esther Kropp Dakubu
|
Montserrat Marimon
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
MultiVal is a valence lexicon derived from lexicons of computational HPSG grammars for Norwegian, Spanish and Ga (ISO 639-3, gaa), with altogether about 22,000 verb entries and on average more than 200 valence types defined for each language. These lexical resources are mapped onto a common set of discriminants with a common array of values, and stored in a relational database linked to a web demo and a wiki presentation. Search discriminants are syntactic argument structure (SAS), functional specification, situation type and aspect, for any subset of languages, as well as the verb type systems of the grammars. Search results are lexical entries satisfying the discriminants entered, exposing the specifications from the respective provenance grammars. The Ga grammar lexicon has in turn been converted from a Ga Toolbox lexicon. Aside from the creation of such a multilingual valence resource through converging or converting existing resources, the paper also addresses a tool for the creation of such a resource as part of corpus annotation for less resourced languages.
pdf
abs
Boosting the creation of a treebank
Blanca Arias
|
Núria Bel
|
Mercè Lorente
|
Montserrat Marimón
|
Alba Milà
|
Jorge Vivaldi
|
Muntsa Padró
|
Marina Fomicheva
|
Imanol Larrea
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
In this paper we present the results of an ongoing experiment of bootstrapping a Treebank for Catalan by using a Dependency Parser trained with Spanish sentences. In order to save time and cost, our approach was to profit from the typological similarities between Catalan and Spanish to create a first Catalan data set quickly by automatically: (i) annotating with a de-lexicalized Spanish parser, (ii) manually correcting the parses, and (iii) using the Catalan corrected sentences to train a Catalan parser. The results showed that the number of parsed sentences required to train a Catalan parser is about 1000 that were achieved in 4 months, with 2 annotators.
pdf
abs
The IULA Spanish LSP Treebank
Montserrat Marimon
|
Núria Bel
|
Beatriz Fisas
|
Blanca Arias
|
Silvia Vázquez
|
Jorge Vivaldi
|
Carlos Morell
|
Mercè Lorente
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This paper presents the IULA Spanish LSP Treebank, a dependency treebank of over 41,000 sentences of different domains (Law, Economy, Computing Science, Environment, and Medicine), developed in the framework of the European project METANET4U. Dependency annotations in the treebank were automatically derived from manually selected parses produced by an HPSG-grammar by a deterministic conversion algorithm that used the identifiers of grammar rules to identify the heads, the dependents, and some dependency types that were directly transferred onto the dependency structure (e.g., subject, specifier, and modifier), and the identifiers of the lexical entries to identify the argument-related dependency functions (e.g. direct object, indirect object, and oblique complement). The treebank is accessible with a browser that provides concordance-based search functions and delivers the results in two formats: (i) a column-based format, in the style of CoNLL-2006 shared task, and (ii) a dependency graph, where dependency relations are noted by an oriented arrow which goes from the dependent node to the head node. The IULA Spanish LSP Treebank is the first technical corpus of Spanish annotated at surface syntactic level following the dependency grammar theory. The treebank has been made publicly and freely available from the META-SHARE platform with a Creative Commons CC-by licence.
2012
pdf
abs
The IULA Treebank
Montserrat Marimon
|
Beatriz Fisas
|
Núria Bel
|
Marta Villegas
|
Jorge Vivaldi
|
Sergi Torner
|
Mercè Lorente
|
Silvia Vázquez
|
Marta Villegas
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
This paper describes on-going work for the construction of a new treebank for Spanish, The IULA Treebank. This new resource will contain about 60,000 richly annotated sentences as an extension of the already existing IULA Technical Corpus which is only PoS tagged. In this paper we have focused on describing the work done for defining the annotation process and the treebank design principles. We report on how the used framework, the DELPH-IN processing framework, has been crucial in the design principles and in the bootstrapping strategy followed, especially in what refers to the use of stochastic modules for reducing parsing overgeneration. We also report on the different evaluation experiments carried out to guarantee the quality of the already available results.
2010
pdf
abs
The Spanish Resource Grammar
Montserrat Marimon
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
This paper describes the Spanish Resource Grammar, an open-source multi-purpose broad-coverage precise grammar for Spanish. The grammar is implemented on the Linguistic Knowledge Builder (LKB) system, it is grounded in the theoretical framework of Head-driven Phrase Structure Grammar (HPSG), and it uses Minimal Recursion Semantics (MRS) for the semantic representation. We have developed a hybrid architecture which integrates shallow processing functionalities -- morphological analysis, and Named Entity recognition and classification -- into the parsing process. The SRG has a full coverage lexicon of closed word classes and it contains 50,852 lexical entries for open word classes. The grammar also has 64 lexical rules to perform valence changing operations on lexical items, and 191 phrase structure rules that combine words and phrases into larger constituents and compositionally build up their semantic representation. The annotation of each parsed sentence in an LKB grammar simultaneously represents a traditional phrase structure tree, and a MRS semantic representation. We provide evaluation results on sentences from newspaper texts and discuss future work.
2008
pdf
abs
Automatic Acquisition for low frequency lexical items
Núria Bel
|
Sergio Espeja
|
Montserrat Marimon
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
This paper addresses a specific case of the task of lexical acquisition understood as the induction of information about the linguistic characteristics of lexical items on the basis of information gathered from their occurrences in texts. Most of the recent works in the area of lexical acquisition have used methods that take as much textual data as possible as source of evidence, but their performance decreases notably when only few occurrences of a word are available. The importance of covering such low frequency items lies in the fact that a large quantity of the words in any particular collection of texts will be occurring few times, if not just once. Our work proposes to compensate the lack of information resorting to linguistic knowledge on the characteristics of lexical classes. This knowledge, obtained from a lexical typology, is formulated probabilistically to be used in a Bayesian method to maximize the information gathered from single occurrences as to predict the full set of characteristics of the word. Our results show that our method achieves better results than others for the treatment of low frequency items.
pdf
abs
COLDIC, a Lexicographic Platform for LMF compliant lexica
Núria Bel
|
Sergio Espeja
|
Montserrat Marimon
|
Marta Villegas
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Despite of the importance of lexical resources for a number of NLP applications (Machine Translation, Information Extraction, Question Answering, among others), there has been a traditional lack of generic tools for the creation, maintenance and management of computational lexica. The most direct obstacle for the development of generic tools, independent of any particular application format, was the lack of standards for the description and encoding of lexical resources. The availability of the Lexical Markup Framework (LMF) has changed this scenario and has made it possible the development of generic lexical platforms. COLDIC is a generic platform for working with computational lexica. The system has been designed to let the user concentrate on lexicographical tasks, but still being autonomous in the management of the tools. The creation and maintenance of the database, which is the core of the tool, demand no specific training in databases. A LMF compliant schema implemented in a Document Type Definition (DTD) describing the lexical resources is taken by the system to automatically configure the platform. Besides, the most standard web services for interoperability are also generated automatically. Other components of the platform include build-in functions supporting the most common tasks of the lexicographic work.
2007
pdf
The Spanish Resource Grammar: Pre-processing Strategy and Lexical Acquisition
Montserrat Marimon
|
Núria Bel
|
Sergio Espeja
|
Natalia Seghezzi
ACL 2007 Workshop on Deep Linguistic Processing
pdf
bib
Automatic Acquisition of Grammatical Types for Nouns
Núria Bel
|
Sergio Espeja
|
Montserrat Marimon
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers
2006
pdf
abs
New tools for the encoding of lexical data extracted from corpus
Núria Bel
|
Sergio Espeja
|
Montserrat Marimon
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
This paper describes the methodology and tools that are the basis of our platform AAILE.4 AAILE has been built for supplying those working in the construction of lexicons for syntactic parsing with more efficient ways of visualizing and analyzing data extracted from corpus. The platform offers support using techniques such as similarity measures, clustering and pattern classification.
2004
pdf
abs
Lexical Entry Templates for Robust Deep Parsing
Montserrat Marimon
|
Núria Bel
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
We report on the development and employment of lexical entry templates in a large--coverage unification--based grammar of Spanish. The aim of the work reported in this paper is to provide robust deep linguistic processing in order to make the grammar more adequate for industrial NLP applications.
2002
pdf
Design and Evaluation of a SLDS for E-Mail Access through the Telephone
Nuria Bel
|
Javier Caminero
|
Luis Hernández
|
Montserrat Marimón
|
José F. Morlesín
|
Josep M. Otero
|
José Relaño
|
M. Carmen Rodríguez
|
Pedro M. Ruz
|
Daniel Tapias
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)
pdf
Integrating Shallow Linguistic Processing into a Unification-based Spanish Grammar
Montserrat Marimon
COLING 2002: The 19th International Conference on Computational Linguistics
2000
pdf
PoS Disambiguation and Partial Parsing Bidirectional Interaction
Montserrat Marimon Felipe
|
Jordi Porta Zamorano
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)