Analysis and Performance of Morphological Query Expansion and Language-Filtering Words on Basque Web Searching
Igor Leturia
Antton Gurrutxaga
Nerea Areta
Eli Pociello
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Morphological query expansion and language-filtering words have proved to be valid methods when searching the web for content in Basque via APIs of commercial search engines, as the implementation of these methods in recent IR and web-as-corpus tools shows, but no real analysis has been carried out to ascertain the degree of improvement, apart from a comparison of recall and precision using a classical web search engine and measured in terms of hit counts. This paper deals with a more theoretical study that confirms the validity of the combination of both methods. We have measured the increase in recall obtained by morphological query expansion and the increase in precision and loss in recall produced by language-filtering-words, but not only by searching the web directly and looking at the hit counts which are not considered to be very reliable at best, but also using both a Basque web corpus and a classical lemmatised corpus, thus providing more exact quantitative results. Furthermore, we provide various corpora-extracted data to be used in the aforementioned methods, such as lists of the most frequent inflections and declinations (cases, persons, numbers, times, etc.) for each POS the most interesting word forms for a morphologically expanded query, or a list of the most used Basque words with their frequencies and document-frequencies the ones that should be used as language-filtering words.
WNTERM: Enriching the MCR with a Terminological Dictionary
Eli Pociello
Antton Gurrutxaga
Eneko Agirre
Izaskun Aldezabal
German Rigau
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
In this paper we describe the methodology and the first steps for the creation of WNTERM (from WordNet and Terminology), a specialized lexicon produced from the merger of the EuroWordNet-based Multilingual Central Repository (MCR) and the Basic Encyclopaedic Dictionary of Science and Technology (BDST). As an example, the ecology domain has been used. The final result is a multilingual (Basque and English) light-weight domain ontology, including taxonomic and other semantic relations among its concepts, which is tightly connected to other wordnets.
A Preliminary Study for Building the Basque PropBank
Eneko Agirre
Izaskun Aldezabal
Jone Etxeberria
Eli Pociello
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
This paper presents a methodology for adding a layer of semantic annotation to a syntactically annotated corpus of Basque (EPEC), in terms of semantic roles. The proposal we make here is the combination of three resources: the model used in the PropBank project (Palmer et al., 2005), an in-house database with syntactic/semantic subcategorization frames for Basque verbs (Aldezabal, 2004) and the Basque dependency treebank (Aduriz et al., 2003). In order to validate the methodology and to confirm whether the PropBank model is suitable for Basque and our treebank design, we have built lexical entries and labelled all argument and adjuncts occurring in our treebank for 3 Basque verbs. The result of this study has been very positive, and has produced a methodology adapted to the characteristics of the language and the Basque dependency treebank. Another goal of this study was to study whether semi-automatic tagging was possible. The idea is to present the human taggers a pre-tagged version of the corpus. We have seen that many arguments could be automatically tagged with high precision, given only the verbal entries for the verbs and a handful of examples.
A methodology for the joint development of the Basque WordNet and Semcor
Eneko Agirre
Izaskun Aldezabal
Jone Etxeberria
Eli Izagirre
Karmele Mendizabal
Eli Pociello
Mikel Quintian
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
This paper describes the methodology adopted to jointly develop the Basque WordNet and a hand annotated corpora (the Basque Semcor). This joint development allows for better motivated sense distinctions, and a tighter coupling between both resources. The methodology involves edition, tagging and refereeing tasks. We are currently half way through the nominal part of the 300.000 word corpus (roughly equivalent to a 500.000 word corpus for English). We present a detailed description of the task, including the main criteria for difficult cases in the edition of the senses and the tagging of the corpus, with special mention to multiword entries. Finally we give a detailed picture of the current figures, as well as an analysis of the agreement rates.
The Basque lexical-sample task
Eneko Agirre
Itziar Aldabe
Mikel Lersundi
David Martínez
Eli Pociello
Larraitz Uria
Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text
The Basque Task: Did Systems Perform in the Upperbound?
Eneko Agirre
Elena Garcia
Mikel Lersundi
David Martinez
Eli Pociello
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems