Arantza Díaz de Ilarraza

Also published as: Arantza Diaz de Ilarraza, A Diaz de Ilarraza, A. Diaz de Ilarraza Sanchez, A. Diaz de Ilarraza, A. Díaz de Ilarraza


Multilingual segmentation based on neural networks and pre-trained word embeddings
Mikel Iruskieta | Kepa Bengoetxea | Aitziber Atutxa Salazar | Arantza Diaz de Ilarraza
Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019

The DISPRT 2019 workshop has organized a shared task aiming to identify cross-formalism and multilingual discourse segments. Elementary Discourse Units (EDUs) are quite similar across different theories. Segmentation is the very first stage on the way of rhetorical annotation. Still, each annotation project adopted several decisions with consequences not only on the annotation of the relational discourse structure but also at the segmentation stage. In this shared task, we have employed pre-trained word embeddings, neural networks (BiLSTM+CRF) to perform the segmentation. We report F1 results for 6 languages: Basque (0.853), English (0.919), French (0.907), German (0.913), Portuguese (0.926) and Spanish (0.868 and 0.769). Finally, we also pursued an error analysis based on clause typology for Basque and Spanish, in order to understand the performance of the segmenter.


Konbitzul: an MWE-specific database for Spanish-Basque
Uxoa Iñurrieta | Itziar Aduriz | Arantza Díaz de Ilarraza | Gorka Labaka | Kepa Sarasola
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Annotating Abstract Meaning Representations for Spanish
Noelia Migueles-Abraira | Rodrigo Agerri | Arantza Diaz de Ilarraza
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)


pdf bib
Enriching Basque Coreference Resolution System using Semantic Knowledge sources
Ander Soraluze | Olatz Arregi | Xabier Arregi | Arantza Díaz de Ilarraza
Proceedings of the 2nd Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2017)

In this paper we present a Basque coreference resolution system enriched with semantic knowledge. An error analysis carried out revealed the deficiencies that the system had in resolving coreference cases in which semantic or world knowledge is needed. We attempt to improve the deficiencies using two semantic knowledge sources, specifically Wikipedia and WordNet.

Rule-Based Translation of Spanish Verb-Noun Combinations into Basque
Uxoa Iñurrieta | Itziar Aduriz | Arantza Díaz de Ilarraza | Gorka Labaka | Kepa Sarasola
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

This paper presents a method to improve the translation of Verb-Noun Combinations (VNCs) in a rule-based Machine Translation (MT) system for Spanish-Basque. Linguistic information about a set of VNCs is gathered from the public database Konbitzul, and it is integrated into the MT system, leading to an improvement in BLEU, NIST and TER scores, as well as the results being evidently better according to human evaluators.

Framework for the Analysis of Simplified Texts Taking Discourse into Account: the Basque Causal Relations as Case Study
Itziar Gonzalez-Dios | Arantza Diaz de Ilarraza | Mikel Iruskieta
Proceedings of the 6th Workshop on Recent Advances in RST and Related Formalisms


Using Linguistic Data for English and Spanish Verb-Noun Combination Identification
Uxoa Iñurrieta | Arantza Díaz de Ilarraza | Gorka Labaka | Kepa Sarasola | Itziar Aduriz | John Carroll
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

We present a linguistic analysis of a set of English and Spanish verb+noun combinations (VNCs), and a method to use this information to improve VNC identification. Firstly, a sample of frequent VNCs are analysed in-depth and tagged along lexico-semantic and morphosyntactic dimensions, obtaining satisfactory inter-annotator agreement scores. Then, a VNC identification experiment is undertaken, where the analysed linguistic data is combined with chunking information and syntactic dependencies. A comparison between the results of the experiment and the results obtained by a basic detection method shows that VNC identification can be greatly improved by using linguistic information, as a large number of additional occurrences are detected with high precision.

Coreference Resolution for the Basque Language with BART
Ander Soraluze | Olatz Arregi | Xabier Arregi | Arantza Díaz de Ilarraza | Mijail Kabadjov | Massimo Poesio
Proceedings of the Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2016)

A Preliminary Study of Statistically Predictive Syntactic Complexity Features and Manual Simplifications in Basque
Itziar Gonzalez-Dios | María Jesús Aranzabe | Arantza Díaz de Ilarraza
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)

In this paper, we present a comparative analysis of statistically predictive syntactic features of complexity and the treatment of these features by humans when simplifying texts. To that end, we have used a list of the most five statistically predictive features obtained automatically and the Corpus of Basque Simplified Texts (CBST) to analyse how the syntactic phenomena in these features have been manually simplified. Our aim is to go beyond the descriptions of operations found in the corpus and relate the multidisciplinary findings to understand text complexity from different points of view. We also present some issues that can be important when analysing linguistic complexity.

pdf bib
The impact of simple feature engineering in multilingual medical NER
Rebecka Weegar | Arantza Casillas | Arantza Diaz de Ilarraza | Maite Oronoz | Alicia Pérez | Koldo Gojenola
Proceedings of the Clinical Natural Language Processing Workshop (ClinicalNLP)

The goal of this paper is to examine the impact of simple feature engineering mechanisms before applying more sophisticated techniques to the task of medical NER. Sometimes papers using scientifically sound techniques present raw baselines that could be improved adding simple and cheap features. This work focuses on entity recognition for the clinical domain for three languages: English, Swedish and Spanish. The task is tackled using simple features, starting from the window size, capitalization, prefixes, and moving to POS and semantic tags. This work demonstrates that a simple initial step of feature engineering can improve the baseline results significantly. Hence, the contributions of this paper are: first, a short list of guidelines well supported with experimental results on three languages and, second, a detailed description of the relevance of these features for medical NER.


pdf bib
Exploiting portability to build an RBMT prototype for a new source language
Nora Aranberri | Gorka Labaka | Arantza Díaz de Ilarraza | Kepa Sarasola
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Exploiting portability to build an RBMT prototype for a new source language
Nora Aranberri | Gorka Labaka | Arantza Díaz de Ilarraza | Kepa Sarasola
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

Deep-syntax TectoMT for English-Spanish MT
Gorka Labaka | Oneka Jauregi | Arantza Díaz de Ilarraza | Michael Ustaszewski | Nora Aranberri | Eneko Agirre
Proceedings of the 1st Deep Machine Translation Workshop


pdf bib
Comparison of post-editing productivity between professional translators and lay users
Nora Aranberri | Gorka Labaka | Arantza Diaz de Ilarraza | Kepa Sarasola
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas

This work compares the post-editing productivity of professional translators and lay users. We integrate an English to Basque MT system within Bologna Translation Service, an end-to-end translation management platform, and perform a producitivity experiment in a real working environment. Six translators and six lay users translate or post-edit two texts from English into Basque. Results suggest that overall, post-editing increases translation throughput for both translators and users, although the latter seem to benefit more from the MT output. We observe that translators and users perceive MT differently. Additionally, a preliminary analysis seems to suggest that familiarity with the domain, source text complexity and MT quality might affect potential productivity gain.

Simple or Complex? Assessing the readability of Basque Texts
Itziar Gonzalez-Dios | María Jesús Aranzabe | Arantza Díaz de Ilarraza | Haritz Salaberri
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

The annotation of the Central Unit in Rhetorical Structure Trees: A Key Step in Annotating Rhetorical Relations
Mikel Iruskieta | Arantza Díaz de Ilarraza | Mikel Lersundi
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
Making Biographical Data in Wikipedia Readable: A Pattern-based Multilingual Approach
Itziar Gonzalez-Dios | María Jesús Aranzabe | Arantza Díaz de Ilarraza
Proceedings of the Workshop on Automatic Text Simplification - Methods and Applications in the Multilingual Society (ATS-MA 2014)


Combining Rule-Based and Statistical Syntactic Analyzers
Iakes Goenaga | Koldobika Gojenola | María Jesús Aranzabe | Arantza Díaz de Ilarraza | Kepa Bengoetxea
Proceedings of the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages

First Approaches on Spanish Medical Record Classification Using Diagnostic Term to Class Transduction
A. Casillas | A. Díaz de Ilarraza | K. Gojenola | M. Oronoz | Alicia Pérez
Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing


Hybrid Machine Translation Guided by a Rule–Based System
Cristina España-Bonet | Gorka Labaka | Arantza Díaz de Ilarraza | Lluís Màrquez
Proceedings of Machine Translation Summit XIII: Papers

Using Kybots for Extracting Events in Biomedical Texts
Arantza Casillas | Arantza Díaz de Ilarraza | Koldo Gojenola | Maite Oronoz | German Rigau
Proceedings of BioNLP Shared Task 2011 Workshop


Building the Basque PropBank
Izaskun Aldezabal | María Jesús Aranzabe | Arantza Díaz de Ilarraza | Ainara Estarrona
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper presents the work that has been carried out to annotate semantic roles in the Basque Dependency Treebank (BDT). We will describe the resources we have used and the way the annotation of 100 verbs has been done. We decide to follow the model proposed in the PropBank project that has been deployed in other languages, such as Chinese, Spanish, Catalan and Russian. The resources used are: an in-house database with syntactic/semantic subcategorization frames for Basque verbs, an English-Basque verb mapping based on Levin’s classification and the BDT itself. Detailed guidelines for human taggers have been established as a result of this annotation process. In addition, we have characterized the information associated to the semantic tag. Besides, and based on this study, we will define semi-automatic procedures that will facilitate the task of manual annotation for the rest of the verbs of the Treebank. We have also adapted AbarHitz, a tool used in the construction of the BDT, for the task of annotating semantic roles according to the proposed characterization.


Relevance of Different Segmentation Options on Spanish-Basque SMT
Arantza Díaz de Ilarraza | Gorka Labaka | Kepa Sarasola
Proceedings of the 13th Annual Conference of the European Association for Machine Translation

Evaluating the Impact of Morphosyntactic Ambiguity in Grammatical Error Detection
Arantza Díaz de Ilarraza | Koldo Gojenola | Maite Oronoz
Proceedings of the International Conference RANLP-2009


pdf bib
Spanish-to-Basque MultiEngine Machine Translation for a Restricted Domain
Iñaki Alegria | Arantza Casillas | Arantza Diaz de Ilarraza | Jon Igartua | Gorka Labaka | Mikel Lersundi | Aingeru Mayor | Kepa Sarasola
Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Research Papers

We present our initial strategy for Spanish-to-Basque MultiEngine Machine Translation, a language pair with very different structure and word order and with no huge parallel corpus available. This hybrid proposal is based on the combination of three different MT paradigms: Example-Based MT, Statistical MT and Rule- Based MT. We have evaluated the system, reporting automatic evaluation metrics for a corpus in a test domain. The first results obtained are encouraging.

Detecting Erroneous Uses of Complex Postpositions in an Agglutinative Language
Arantza Díaz de Ilarraza | Koldo Gojenola | Maite Oronoz
Coling 2008: Companion volume: Posters

Strategies for sustainable MT for Basque: incremental design, reusability, standardization and open-source
I. Alegria | X. Arregi | X. Artola | A. Diaz de Ilarraza | G. Labaka | M. Lersundi | A. Mayor | K. Sarasola
Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages


Structure, Annotation and Tools in the Basque ZT Corpus
N. Areta | A. Gurrutxaga | I. Leturia | Z. Polin | R. Saiz | I. Alegria | X. Artola | A. Diaz de Ilarraza | N. Ezeiza | A. Sologaistoa | A. Soroa | A. Valverde
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The ZT corpus (Basque Corpus of Science and Technology) is a tagged collection of specialized texts in Basque, which wants to be a main resource in research and development about written technical Basque: terminology, syntax and style. It will be the first written corpus in Basque which will be distributed by ELDA (at the end of 2006) and it wants to be a methodological and functional reference for new projects in the future (i.e. a national corpus for Basque). We also present the technology and the tools to build this Corpus. These tools, Corpusgile and Eulia, provide a flexible and extensible infrastructure for creating, visualizing and managing corpora and for consulting, visualizing and modifying annotations generated by linguistic tools.

pdf bib
Using Machine Learning Techniques to Build a Comma Checker for Basque
Iñaki Alegria | Bertol Arrieta | Arantza Diaz de Ilarraza | Eli Izagirre | Montse Maritxalar
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions


pdf bib
An Open Architecture for Transfer-based Machine Translation between Spanish and Basque
Iñaki Alegria | Arantza Diaz de Ilarraza | Gorka Labaka | Mikel Lersundi | Aingeru Mayor | Kepa Sarasola | Mikel L. Forcada | Sergio Ortiz-Rojas | Lluís Padró
Workshop on open-source machine translation

We present the current status of development of an open architecture for the translation from Spanish into Basque. The machine translation architecture uses an open source analyser for Spanish and new modules mainly based on finite-state transducers. The project is integrated in the OpenTrad initiative, a larger government funded project shared among different universities and small companies, which will also include MT engines for translation among the main languages in Spain. The main objective is the construction of an open, reusable and interoperable framework. This paper describes the design of the engine, the formats it uses for the communication among the modules, the modules reused from other project named Matxin and the new modules we are building.


Abar-Hitz: An Annotation Tool for the Basque Dependency Treebank
Arantza Díaz de Ilarraza | Aitzpea Garmendia | Maite Oronoz
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

Towards a Dependency Parser for Basque
M.J. Aranzabe | J.M. Arriola | A. Diaz de Ilarraza
Proceedings of the Workshop on Recent Advances in Dependency Grammar


Semiautomatic Labelling of Semantic Features
Arantza Díaz de Ilarraza | Aingeru Mayor | Kepa Sarasola
COLING 2002: The 19th International Conference on Computational Linguistics

A Class Library for the Integration of NLP Tools: Definition and implementation of an Abstract Data Type Collection for the manipulation of SGML documents in a context of stand-off linguistic annotation
X. Artola | A. Díaz de Ilarraza | N. Ezeiza | K. Gojenola | G. Hernández | A. Soroa
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)


A Proposal for the Integration of NLP Tools using SGML-Tagged Documents
X. Artola | A. Díaz de Ilarraza | N. Ezeiza | K. Gojenola | A. Maritxalar | A. Soroa
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)


From Psycholinguistic Modelling of Interlanguage in Second Language Acquisition to a Computational Model
Montse Maritxalar | Arantza Diaz de Ilarraza | Maite Oronoz
CoNLL97: Computational Natural Language Learning


Lexical, Knowledge Representation in an Intelligent Dictionary Help System
E. Agirre | X. Arregi | X. Artola | A. Diaz de Ilarraza | K. Sarasola
COLING 1994 Volume 1: The 15th International Conference on Computational Linguistics


A Morphological Analysis Based Method for Spelling Correction
I. Aduriz | E. Agirre | I. Alegria | X. Arregi | J.M Arriola | X. Artola | A. Diaz de Ilarraza | N. Ezeiza | M. Maritxalar | K. Sarasola | M. Urkia
Sixth Conference of the European Chapter of the Association for Computational Linguistics


XUXEN: A Spelling Checker/Corrector for Basque Based on Two-Level Morphology
E. Agirre | I Alegria | X Arregi | X Artola | A Diaz de Ilarraza | M Maritxalar | K Sarasola | M Urkia
Third Conference on Applied Natural Language Processing


A Mechanism for ellipsis resolution in dialogued systems
A. Diaz de Ilarraza Sanchez | H. Rodriguez Hontoria | F. Maillo Verdejo
COLING 1990 Volume 3: Papers presented to the 13th International Conference on Computational Linguistics