Itziar Aldabe


Overview of the ELE Project
Itziar Aldabe | Jane Dunne | Aritz Farwell | Owen Gallagher | Federico Gaspari | Maria Giagkou | Jan Hajic | Jens Peter Kückens | Teresa Lynn | Georg Rehm | German Rigau | Katrin Marheinecke | Stelios Piperidis | Natalia Resende | Tea Vojtěchová | Andy Way
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

This paper provides an overview of the ongoing European Language Equality(ELE) project, an 18-month action funded by the European Commission which involves 52 partners. The primary goal of ELE is to prepare the European Language Equality Programme, in the form of a strategic research, innovation and implementation agenda and a roadmap for achieving full digital language equality (DLE) in Europe by 2030.

Does Corpus Quality Really Matter for Low-Resource Languages?
Mikel Artetxe | Itziar Aldabe | Rodrigo Agerri | Olatz Perez-de-Viñaspre | Aitor Soroa
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

The vast majority of non-English corpora are derived from automatically filtered versions of CommonCrawl. While prior work has identified major issues on the quality of these datasets (Kreutzer et al., 2021), it is not clear how this impacts downstream performance. Taking representation learning in Basque as a case study, we explore tailored crawling (manually identifying and scraping websites with high-quality content) as an alternative to filtering CommonCrawl. Our new corpus, called EusCrawl, is similar in size to the Basque portion of popular multilingual corpora like CC100 and mC4, yet it has a much higher quality according to native annotators. For instance, 66% of documents are rated as high-quality for EusCrawl, in contrast with <33% for both mC4 and CC100. Nevertheless, we obtain similar results on downstream NLU tasks regardless of the corpus used for pre-training. Our work suggests that NLU performance in low-resource languages is not primarily constrained by the quality of the data, and other factors like corpus size and domain coverage can play a more important role.

pdf bib
Proceedings of the Workshop Towards Digital Language Equality within the 13th Language Resources and Evaluation Conference
Itziar Aldabe | Begoña Altuna | Aritz Farwell | German Rigau
Proceedings of the Workshop Towards Digital Language Equality within the 13th Language Resources and Evaluation Conference


Linguistic Appropriateness and Pedagogic Usefulness of Reading Comprehension Questions
Andrea Horbach | Itziar Aldabe | Marie Bexte | Oier Lopez de Lacalle | Montse Maritxalar
Proceedings of the Twelfth Language Resources and Evaluation Conference

Automatic generation of reading comprehension questions is a topic receiving growing interest in the NLP community, but there is currently no consensus on evaluation metrics and many approaches focus on linguistic quality only while ignoring the pedagogic value and appropriateness of questions. This paper overcomes such weaknesses by a new evaluation scheme where questions from the questionnaire are structured in a hierarchical way to avoid confronting human annotators with evaluation measures that do not make sense for a certain question. We show through an annotation study that our scheme can be applied, but that expert annotators with some level of expertise are needed. We also created and evaluated two new evaluation data sets from the biology domain for Basque and German, composed of questions written by people with an educational background, which will be publicly released. Results show that manually generated questions are in general both of higher linguistic as well as pedagogic quality and that among the human generated questions, teacher-generated ones tend to be most useful.

Domain Adapted Distant Supervision for Pedagogically Motivated Relation Extraction
Oscar Sainz | Oier Lopez de Lacalle | Itziar Aldabe | Montse Maritxalar
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper we present a relation extraction system that given a text extracts pedagogically motivated relation types, as a previous step to obtaining a semantic representation of the text which will make possible to automatically generate questions for reading comprehension. The system maps pedagogically motivated relations with relations from ConceptNet and deploys Distant Supervision for relation extraction. We run a study on a subset of those relationships in order to analyse the viability of our approach. For that, we build a domain-specific relation extraction system and explore two relation extraction models: a state-of-the-art model based on transfer learning and a discrete feature based machine learning model. Experiments show that the neural model obtains better results in terms of F-score and we yield promising results on the subset of relations suitable for pedagogical purposes. We thus consider that distant supervision for relation extraction is a valid approach in our target domain, i.e. biology.


Building Named Entity Recognition Taggers via Parallel Corpora
Rodrigo Agerri | Yiling Chung | Itziar Aldabe | Nora Aranberri | Gorka Labaka | German Rigau
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)


A Multilingual Predicate Matrix
Maddalen Lopez de Lacalle | Egoitz Laparra | Itziar Aldabe | German Rigau
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents the Predicate Matrix 1.3, a lexical resource resulting from the integration of multiple sources of predicate information including FrameNet, VerbNet, PropBank and WordNet. This new version of the Predicate Matrix has been extended to cover nominal predicates by adding mappings to NomBank. Similarly, we have integrated resources in Spanish, Catalan and Basque. As a result, the Predicate Matrix 1.3 provides a multilingual lexicon to allow interoperable semantic analysis in multiple languages.


Semantic Interoperability for Cross-lingual and cross-document Event Detection
Piek Vossen | Egoitz Laparra | German Rigau | Itziar Aldabe
Proceedings of the The 3rd Workshop on EVENTS: Definition, Detection, Coreference, and Representation

From TimeLines to StoryLines: A preliminary proposal for evaluating narratives
Egoitz Laparra | Itziar Aldabe | German Rigau
Proceedings of the First Workshop on Computing News Storylines

SemEval-2015 Task 4: TimeLine: Cross-Document Event Ordering
Anne-Lyse Minard | Manuela Speranza | Eneko Agirre | Itziar Aldabe | Marieke van Erp | Bernardo Magnini | German Rigau | Rubén Urizar
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

Document Level Time-anchoring for TimeLine Extraction
Egoitz Laparra | Itziar Aldabe | German Rigau
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)


EHU-ALM: Similarity-Feature Based Approach for Student Response Analysis
Itziar Aldabe | Montse Maritxalar | Oier Lopez de Lacalle
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)


pdf bib
The Basque lexical-sample task
Eneko Agirre | Itziar Aldabe | Mikel Lersundi | David Martínez | Eli Pociello | Larraitz Uria
Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text