Jorge Vivaldi

2017

pdf abs
Annotation of negation in the IULA Spanish Clinical Record Corpus
Montserrat Marimon | Jorge Vivaldi | Núria Bel
Proceedings of the Workshop Computational Semantics Beyond Events and Roles

This paper presents the IULA Spanish Clinical Record Corpus, a corpus of 3,194 sentences extracted from anonymized clinical records and manually annotated with negation markers and their scope. The corpus was conceived as a resource to support clinical text-mining systems, but it is also a useful resource for other Natural Language Processing systems handling clinical texts: automatic encoding of clinical records, diagnosis support, term extraction, among others, as well as for the study of clinical texts. The corpus is publicly available with a CC-BY-SA 3.0 license.

2016

pdf
Syntactic methods for negation detection in radiology reports in Spanish
Viviana Cotik | Vanesa Stricker | Jorge Vivaldi | Horacio Rodriguez
Proceedings of the 15th Workshop on Biomedical Natural Language Processing

2014

In this paper we present the results of an ongoing experiment of bootstrapping a Treebank for Catalan by using a Dependency Parser trained with Spanish sentences. In order to save time and cost, our approach was to profit from the typological similarities between Catalan and Spanish to create a first Catalan data set quickly by automatically: (i) annotating with a de-lexicalized Spanish parser, (ii) manually correcting the parses, and (iii) using the Catalan corrected sentences to train a Catalan parser. The results showed that the number of parsed sentences required to train a Catalan parser is about 1000 that were achieved in 4 months, with 2 annotators.

This paper presents the IULA Spanish LSP Treebank, a dependency treebank of over 41,000 sentences of different domains (Law, Economy, Computing Science, Environment, and Medicine), developed in the framework of the European project METANET4U. Dependency annotations in the treebank were automatically derived from manually selected parses produced by an HPSG-grammar by a deterministic conversion algorithm that used the identifiers of grammar rules to identify the heads, the dependents, and some dependency types that were directly transferred onto the dependency structure (e.g., subject, specifier, and modifier), and the identifiers of the lexical entries to identify the argument-related dependency functions (e.g. direct object, indirect object, and oblique complement). The treebank is accessible with a browser that provides concordance-based search functions and delivers the results in two formats: (i) a column-based format, in the style of CoNLL-2006 shared task, and (ii) a dependency graph, where dependency relations are noted by an oriented arrow which goes from the dependent node to the head node. The IULA Spanish LSP Treebank is the first technical corpus of Spanish annotated at surface syntactic level following the dependency grammar theory. The treebank has been made publicly and freely available from the META-SHARE platform with a Creative Commons CC-by licence.

2012

pdf abs
Iula2Standoff: a tool for creating standoff documents for the IULACT
Carlos Morell | Jorge Vivaldi | Núria Bel
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Due to the increase in the number and depth of analyses required over the text, like entity recognition, POS tagging, syntactic analysis, etc. the annotation in-line has become unpractical. In Natural Language Processing (NLP) some emphasis has been placed in finding an annotation method to solve this problem. A possibility is the standoff annotation. With this annotation style it is possible to add new levels of annotation without disturbing exiting ones, with minimal knock on effects. This annotation will increase the possibility of adding more linguistic information as well as more possibilities for sharing textual resources. In this paper we present a tool developed in the framework of the European Metanet4u (Enhancing the European Linguistic Infrastructure, GA 270893) for creating a multi-layered XML annotation scheme, based on the GrAF proposal for standoff annotations.

pdf abs
Using Wikipedia to Validate the Terminology found in a Corpus of Basic Textbooks
Jorge Vivaldi | Luis Adrián Cabrera-Diego | Gerardo Sierra | María Pozzi
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

A scientific vocabulary is a set of terms that designate scientific concepts. This set of lexical units can be used in several applications ranging from the development of terminological dictionaries and machine translation systems to the development of lexical databases and beyond. Even though automatic term recognition systems exist since the 80s, this process is still mainly done by hand, since it generally yields more accurate results, although not in less time and at a higher cost. Some of the reasons for this are the fairly low precision and recall results obtained, the domain dependence of existing tools and the lack of available semantic knowledge needed to validate these results. In this paper we present a method that uses Wikipedia as a semantic knowledge resource, to validate term candidates from a set of scientific text books used in the last three years of high school for mathematics, health education and ecology. The proposed method may be applied to any domain or language (assuming there is a minimal coverage by Wikipedia).

This paper describes on-going work for the construction of a new treebank for Spanish, The IULA Treebank. This new resource will contain about 60,000 richly annotated sentences as an extension of the already existing IULA Technical Corpus which is only PoS tagged. In this paper we have focused on describing the work done for defining the annotation process and the treebank design principles. We report on how the used framework, the DELPH-IN processing framework, has been crucial in the design principles and in the bootstrapping strategy followed, especially in what refers to the use of stochastic modules for reducing parsing overgeneration. We also report on the different evaluation experiments carried out to guarantee the quality of the already available results.

2010

pdf abs
Automatic Summarization Using Terminological and Semantic Resources
Jorge Vivaldi | Iria da Cunha | Juan-Manuel Torres-Moreno | Patricia Velázquez-Morales
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper presents a new algorithm for automatic summarization of specialized texts combining terminological and semantic resources: a term extractor and an ontology. The term extractor provides the list of the terms that are present in the text together their corresponding termhood. The ontology is used to calculate the semantic similarity among the terms found in the main body and those present in the document title. The general idea is to obtain a relevance score for each sentence taking into account both the termhood of the terms found in such sentence and the similarity among such terms and those terms present in the title of the document. The phrases with the highest score are chosen to take part of the final summary. We evaluate the algorithm with Rouge, comparing the resulting summaries with the summaries of other summarizers. The sentence selection algorithm was also tested as part of a standalone summarizer. In both cases it obtains quite good results although the perception is that there is a space for improvement.

pdf abs
Finding Domain Terms using Wikipedia
Jorge Vivaldi | Horacio Rodríguez
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper we present a new approach for obtaining the terminology of a given domain using the category and page structures of the Wikipedia in a language independent way. Our approach consists basically, for each domain, on navigating the Category graph of the Wikipedia starting from the root nodes associated to the domain. A heavy filtering mechanism is carried out for preventing as much as possible the inclusion of spurious categories. For each selected category all the pages belonging to it are then recovered and filtered. This procedure is iterate several times until achieving convergence. Both category names and page names are considered candidates to belong to the terminology of the domain. This approach has been applied to three broad coverage domains: astronomy, chemistry and medicine, and two languages, English and Spanish, showing a promising performance.

2008

pdf abs
A Suite to Compile and Analyze an LSP Corpus
Rogelio Nazar | Jorge Vivaldi | Teresa Cabré
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper presents a series of tools for the extraction of specialized corpora from the web and its subsequent analysis mainly with statistical techniques. It is an integrated system of original as well as standard tools and has a modular conception that facilitates its re-integration on different systems. The first part of the paper describes the original techniques, which are devoted to the categorization of documents as relevant or irrelevant to the corpus under construction, considering relevant a specialized document of the selected technical domain. Evaluation figures are provided for the original part, but not for the second part involving the analysis of the corpus, which is composed of algorithms that are well known in the field of Natural Language Processing, such as Kwic search, measures of vocabulary richness, the sorting of n-grams by frequency of occurrence or by measures of statistical association, distribution or similarity.

pdf abs
Turning a Term Extractor into a new Domain: first Experiences
Jorge Vivaldi | Anna Joan | Mercè Lorente
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Computational terminology has notably evolved since the advent of computers. Regarding the extraction of terms in particular, a large number of resources have been developed: from very general tools to other much more specific acquisition methodologies. Such acquisition methodologies range from using simple linguistic patterns or frequency counting methods to using much more evolved strategies combining morphological, syntactical, semantical and contextual information. Researchers usually develop a term extractor to be applied to a given domain and, in some cases, some testing about the tool performance is also done. Afterwards, such tools may also be applied to other domains, though frequently no additional test is made in such cases. Usually, the application of a given tool to other domain does not require any tuning. Recently, some tools using semantic resources have been developed. In such cases, either a domain-specific or a generic resource may be used. In the latter case, some tuning may be necessary in order to adapt the tool to a new domain. In this paper, we present the task started in order to adapt YATE, a term extractor that uses a generic resource as EWN and that is already developed for the medical domain, into the economic one.

pdf
Two-step flow in bilingual lexicon extraction from unrelated corpora
Rogelio Nazar | Leo Wanner | Jorge Vivaldi
Proceedings of the 12th Annual conference of the European Association for Machine Translation

2006

pdf abs
SKELETON: Specialised knowledge retrieval on the basis of terms and conceptual relations
Judit Feliu | Jorge Vivaldi | M. Teresa Cabré
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The main goal of this paper is to present a first approach to an automatic detection of conceptual relations between two terms in specialised written text. Previous experiments on the basis of the manual analysis lead the authors to implement an automatic query strategy combining the term candidates proposed by an extractor together with a list of verbal syntactic patterns used for the relations refinement. Next step on the research will be the integration of the results into the term extractor in order to attain more restrictive pieces of information directly reused for the ontology building task.

2004

pdf
The GENOMA-KB Project: Towards the Integration of Concepts, Terms, Textual Corpora and Entities
M. Teresa Cabré | Carme Bach | Rosa Estopà | Judit Feliu | Gemma Martínez | Jorge Vivaldi
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf abs
Automatically Selecting Domain Markers for Terminology Extraction
Jorge Vivaldi | Horacio Rodríguez
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

Some approaches to automatic terminology extraction from corpora imply the use of existing semantic resources for guiding the detection of terms. Most of these systems exploit specialised resources, like UMLS in the medical domain, while a few try to take profit from general-purpose semantic resources, like EuroWordNet (EWN). As the term extraction task is clearly domain depending, in the case a general-purpose resource without specific domain information is used, we need a way of attaching domain information to the units of the resource. For big resources it is desirable that this semantic enrichment could be carried out automatically. Given a specific domain, our proposal aims to detect in EWN those units that can be considered as domain markers (DM). We can define a DM as an EWN entry whose attached strings belong to the domain, as well as the variants of all its descendents through the hyponymy relation. The procedure we propose in this paper is fully automatic and, a priori, domain-independent. The only external knowledge it uses is a set of terms, which is an external vocabulary, which is considered to have at least one sense belonging to the domain.

2002

pdf
Towards an Ontology for a Human Genome Knowledge Base
Judit Feliu | Jorge Vivaldi | M. Teresa Cabré
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)