Antton Gurrutxaga

Also published as: A. Gurrutxaga

2018

This paper presents a Basque corpus where Verbal Multiword Expressions (VMWEs) were annotated following universal guidelines. Information on the annotation is given, and some ideas for discussion upon the guidelines are also proposed. The corpus is useful not only for NLP-related research, but also to draw conclusions on Basque phraseology in comparison with other languages.

2016

pdf abs
Fostering digital representation of EU regional and minority languages: the Digital Language Diversity Project
Claudia Soria | Irene Russo | Valeria Quochi | Davyth Hicks | Antton Gurrutxaga | Anneli Sarhimaa | Matti Tuomisto
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Poor digital representation of minority languages further prevents their usability on digital media and devices. The Digital Language Diversity Project, a three-year project funded under the Erasmus+ programme, aims at addressing the problem of low digital representation of EU regional and minority languages by giving their speakers the intellectual an practical skills to create, share, and reuse online digital content. Availability of digital content and technical support to use it are essential prerequisites for the development of language-based digital applications, which in turn can boost digital usage of these languages. In this paper we introduce the project, its aims, objectives and current activities for sustaining digital usability of minority languages through adult education.

2013

pdf
Combining Different Features of Idiomaticity for the Automatic Classification of Noun+Verb Expressions in Basque
Antton Gurrutxaga | Iñaki Alegria
Proceedings of the 9th Workshop on Multiword Expressions

2012

pdf abs
Measuring the compositionality of NV expressions in Basque by means of distributional similarity techniques
Antton Gurrutxaga | Iñaki Alegria
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present several experiments aiming at measuring the semantic compositionality of NV expressions in Basque. Our approach is based on the hypothesis that compositionality can be related to distributional similarity. The contexts of each NV expression are compared with the contexts of its corresponding components, by means of different techniques, as similarity measures usually used with the Vector Space Model (VSM), Latent Semantic Analysis (LSA) and some measures implemented in the Lemur Toolkit, as Indri index, tf-idf, Okapi index and Kullback-Leibler divergence. Using our previous work with cooccurrence techniques as a baseline, the results point to improvements using the Indri index or Kullback-Leibler divergence, and a slight further improvement when used in combination with cooccurrence measures such as $t$-score, via rank-aggregation. This work is part of a project for MWE extraction and characterization using different techniques aiming at measuring the properties related to idiomaticity, as institutionalization, non-compositionality and lexico-syntactic fixedness.

2011

pdf bib
Automatic Extraction of NV Expressions in Basque: Basic Issues on Cooccurrence Techniques
Antton Gurrutxaga | Iñaki Alegria
Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World

2008

pdf abs
Analysis and Performance of Morphological Query Expansion and Language-Filtering Words on Basque Web Searching
Igor Leturia | Antton Gurrutxaga | Nerea Areta | Eli Pociello
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Morphological query expansion and language-filtering words have proved to be valid methods when searching the web for content in Basque via APIs of commercial search engines, as the implementation of these methods in recent IR and web-as-corpus tools shows, but no real analysis has been carried out to ascertain the degree of improvement, apart from a comparison of recall and precision using a classical web search engine and measured in terms of hit counts. This paper deals with a more theoretical study that confirms the validity of the combination of both methods. We have measured the increase in recall obtained by morphological query expansion and the increase in precision and loss in recall produced by language-filtering-words, but not only by searching the web directly and looking at the hit counts which are not considered to be very reliable at best, but also using both a Basque web corpus and a classical lemmatised corpus, thus providing more exact quantitative results. Furthermore, we provide various corpora-extracted data to be used in the aforementioned methods, such as lists of the most frequent inflections and declinations (cases, persons, numbers, times, etc.) for each POS the most interesting word forms for a morphologically expanded query, or a list of the most used Basque words with their frequencies and document-frequencies the ones that should be used as language-filtering words.

pdf abs
WNTERM: Enriching the MCR with a Terminological Dictionary
Eli Pociello | Antton Gurrutxaga | Eneko Agirre | Izaskun Aldezabal | German Rigau
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we describe the methodology and the first steps for the creation of WNTERM (from WordNet and Terminology), a specialized lexicon produced from the merger of the EuroWordNet-based Multilingual Central Repository (MCR) and the Basic Encyclopaedic Dictionary of Science and Technology (BDST). As an example, the ecology domain has been used. The final result is a multilingual (Basque and English) light-weight domain ontology, including taxonomic and other semantic relations among its concepts, which is tightly connected to other wordnets.

2006

The ZT corpus (Basque Corpus of Science and Technology) is a tagged collection of specialized texts in Basque, which wants to be a main resource in research and development about written technical Basque: terminology, syntax and style. It will be the first written corpus in Basque which will be distributed by ELDA (at the end of 2006) and it wants to be a methodological and functional reference for new projects in the future (i.e. a national corpus for Basque). We also present the technology and the tools to build this Corpus. These tools, Corpusgile and Eulia, provide a flexible and extensible infrastructure for creating, visualizing and managing corpora and for consulting, visualizing and modifying annotations generated by linguistic tools.

2004

This project combines linguistic and statistical information to develop a term extraction tool for Basque. Being Basque an agglutinative and highly inflected language, the treatment of morphosyntactic information is vital. In addition, due to late unification process of the language, texts present more elevated term dispersion than in a highly normalized language. The result is a semi-automatic terminology extraction tool based on XML, for its use in technical and scientific information managing.