2023
pdf
bib
abs
Termout: a tool for the semi-automatic creation of term databases
Rogelio Nazar
|
Nicolas Acosta
Proceedings of the Workshop on Computational Terminology in NLP and Translation Studies (ConTeNTS) Incorporating the 16th Workshop on Building and Using Comparable Corpora (BUCC)
We propose a tool for the semi-automatic production of terminological databases, divided in the steps of corpus processing, terminology extraction, database population and management. With this tool it is possible to obtain a draft macrostructure (a lemma-list) and data for the microstructural level, such as grammatical (morphosyntactic patterns, gender, formation process) and semantic information (hypernyms, equivalence in another language, definitions and synonyms). In this paper we offer an overall description of the software and an evaluation of its performance, for which we used a linguistics corpus in English and Spanish.
2022
pdf
abs
Terminology extraction using co-occurrence patterns as predictors of semantic relevance
Rogelio Nazar
|
David Lindemann
Proceedings of the Workshop on Terminology in the 21st century: many faces, many places
We propose a method for automatic term extraction based on a statistical measure that ranks term candidates according to their semantic relevance to a specialised domain. As a measure of relevance we use term co-occurrence, defined as the repeated instantiation of two terms in the same sentences, in indifferent order and at variable distances. In this way, term candidates are ranked higher if they show a tendency to co-occur with a selected group of other units, as opposed to those showing more uniform distributions. No external resources are needed for the application of the method, but performance improves when provided with a pre-existing term list. We present results of the application of this method to a Spanish-English Linguistics corpus, and the evaluation compares favourably with a standard method based on reference corpora.
2017
pdf
Experiments in taxonomy induction in Spanish and French
Irene Renau
|
Rogelio Nazar
|
Rafael Marín
Proceedings of Language, Ontology, Terminology and Knowledge Structures Workshop (LOTKS 2017)
2016
pdf
abs
A Taxonomy of Spanish Nouns, a Statistical Algorithm to Generate it and its Implementation in Open Source Code
Rogelio Nazar
|
Irene Renau
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
In this paper we describe our work in progress in the automatic development of a taxonomy of Spanish nouns, we offer the Perl implementation we have so far, and we discuss the different problems that still need to be addressed. We designed a statistically-based taxonomy induction algorithm consisting of a combination of different strategies not involving explicit linguistic knowledge. Being all quantitative, the strategies we present are however of different nature. Some of them are based on the computation of distributional similarity coefficients which identify pairs of sibling words or co-hyponyms, while others are based on asymmetric co-occurrence and identify pairs of parent-child words or hypernym-hyponym relations. A decision making process is then applied to combine the results of the previous steps, and finally connect lexical units to a basic structure containing the most general categories of the language. We evaluate the quality of the taxonomy both manually and also using Spanish Wordnet as a gold-standard. We estimate an average of 89.07% precision and 25.49% recall considering only the results which the algorithm presents with high degree of certainty, or 77.86% precision and 33.72% recall considering all results.
2014
pdf
abs
An Exercise in Reuse of Resources: Adapting General Discourse Coreference Resolution for Detecting Lexical Chains in Patent Documentation
Nadjet Bouayad-Agha
|
Alicia Burga
|
Gerard Casamayor
|
Joan Codina
|
Rogelio Nazar
|
Leo Wanner
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
The Stanford Coreference Resolution System (StCR) is a multi-pass, rule-based system that scored best in the CoNLL 2011 shared task on general discourse coreference resolution. We describe how the StCR has been adapted to the specific domain of patents and give some cues on how it can be adapted to other domains. We present a linguistic analysis of the patent domain and how we were able to adapt the rules to the domain and to expand coreferences with some lexical chains. A comparative evaluation shows an improvement of the coreference resolution system, denoting that (i) StCR is a valuable tool across different text genres; (ii) specialized discourse NLP may significantly benefit from general discourse NLP research.
2012
pdf
Google Books N-gram Corpus used as a Grammar Checker
Rogelio Nazar
|
Irene Renau
Proceedings of the Second Workshop on Computational Linguistics and Writing (CL&W 2012): Linguistic and Cognitive Aspects of Document Creation and Document Engineering
pdf
abs
Spell Checking in Spanish: The Case of Diacritic Accents
Jordi Atserias
|
Maria Fuentes
|
Rogelio Nazar
|
Irene Renau
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
This article presents the problem of diacritic restoration (or diacritization) in the context of spell-checking, with the focus on an orthographically rich language such as Spanish. We argue that despite the large volume of work published on the topic of diacritization, currently available spell-checking tools have still not found a proper solution to the problem in those cases where both forms of a word are listed in the checker's dictionary. This is the case, for instance, when a word form exists with and without diacritics, such as continuo continuous' and continuó he/she/it continued', or when different diacritics make other word distinctions, as in continúo I continue'. We propose a very simple solution based on a word bigram model derived from correctly typed Spanish texts and evaluate the ability of this model to restore diacritics in artificial as well as real errors. The case of diacritics is only meant to be an example of the possible applications for this idea, yet we believe that the same method could be applied to other kinds of orthographic or even grammatical errors. Moreover, given that no explicit linguistic knowledge is required, the proposed model can be used with other languages provided that a large normative corpus is available.
2010
pdf
abs
Combining Resources: Taxonomy Extraction from Multiple Dictionaries
Rogelio Nazar
|
Maarten Janssen
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
The idea that dictionaries are a good source for (computational) information has been around for a long while, and the extraction of taxonomic information from them is something that has been attempted several times. However, such information extraction was typically based on the systematic analysis of the text of a single dictionary. In this paper, we demonstrate how it is possible to extract taxonomic information without any analysis of the specific text, by comparing the same lexical entry in a number of different dictionaries. Counting word frequencies in the dictionary entry for the same word in different dictionaries leads to a surprisingly good recovery of taxonomic information, without the need for any syntactic analysis of the entries in question nor any kind of language-specific treatment. As a case in point, we will show in this paper an experiment extracting hyperonymy relations from several Spanish dictionaries, measuring the effect that the different number of dictionaries have on the results.
2008
pdf
Two-step flow in bilingual lexicon extraction from unrelated corpora
Rogelio Nazar
|
Leo Wanner
|
Jorge Vivaldi
Proceedings of the 12th Annual Conference of the European Association for Machine Translation
pdf
abs
A Suite to Compile and Analyze an LSP Corpus
Rogelio Nazar
|
Jorge Vivaldi
|
Teresa Cabré
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
This paper presents a series of tools for the extraction of specialized corpora from the web and its subsequent analysis mainly with statistical techniques. It is an integrated system of original as well as standard tools and has a modular conception that facilitates its re-integration on different systems. The first part of the paper describes the original techniques, which are devoted to the categorization of documents as relevant or irrelevant to the corpus under construction, considering relevant a specialized document of the selected technical domain. Evaluation figures are provided for the original part, but not for the second part involving the analysis of the corpus, which is composed of algorithms that are well known in the field of Natural Language Processing, such as Kwic search, measures of vocabulary richness, the sorting of n-grams by frequency of occurrence or by measures of statistical association, distribution or similarity.