VERTa: a Linguistically-motivated Metric at the WMT15 Metrics Task
Elisabet Comelles
Jordi Atserias
Proceedings of the Tenth Workshop on Statistical Machine Translation
Proceedings of the First Workshop on Computing News Storylines
Tommaso Caselli
Marieke van Erp
Anne-Lyse Minard
Mark Finlayson
Ben Miller
Jordi Atserias
Alexandra Balahur
Piek Vossen
Proceedings of the First Workshop on Computing News Storylines
VERTa participation in the WMT14 Metrics Task
Elisabet Comelles
Jordi Atserias
Proceedings of the Ninth Workshop on Statistical Machine Translation
VERTa: Facing a Multilingual Experience of a Linguistically-based MT Evaluation
Elisabet Comelles
Jordi Atserias
Victoria Arranz
Irene Castellón
Jordi Sesé
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
There are several MT metrics used to evaluate translation into Spanish, although most of them use partial or little linguistic information. In this paper we present the multilingual capability of VERTa, an automatic MT metric that combines linguistic information at lexical, morphological, syntactic and semantic level. In the experiments conducted we aim at identifying those linguistic features that prove the most effective to evaluate adequacy in Spanish segments. This linguistic information is tested both as independent modules (to observe what each type of feature provides) and in a combinatory fastion (where different kinds of information interact with each other). This allows us to extract the optimal combination. In addition we compare these linguistic features to those used in previous versions of VERTa aimed at evaluating adequacy for English segments. Finally, experiments show that VERTa can be easily adapted to other languages than English and that its collaborative approach correlates better with human judgements on adequacy than other well-known metrics.
FBM: Combining lexicon-based ML and heuristics for Social Media Polarities
Carlos Rodríguez-Penagos
Jordi Atserias Batalla
Joan Codina-Filbà
David García-Narbona
Jens Grivolla
Patrik Lambert
Roser Saurí
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)
Spell Checking in Spanish: The Case of Diacritic Accents
Jordi Atserias
Maria Fuentes
Rogelio Nazar
Irene Renau
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
This article presents the problem of diacritic restoration (or diacritization) in the context of spell-checking, with the focus on an orthographically rich language such as Spanish. We argue that despite the large volume of work published on the topic of diacritization, currently available spell-checking tools have still not found a proper solution to the problem in those cases where both forms of a word are listed in the checker's dictionary. This is the case, for instance, when a word form exists with and without diacritics, such as continuo continuous' and continuó he/she/it continued', or when different diacritics make other word distinctions, as in continúo I continue'. We propose a very simple solution based on a word bigram model derived from correctly typed Spanish texts and evaluate the ability of this model to restore diacritics in artificial as well as real errors. The case of diacritics is only meant to be an example of the possible applications for this idea, yet we believe that the same method could be applied to other kinds of orthographic or even grammatical errors. Moreover, given that no explicit linguistic knowledge is required, the proposed model can be used with other languages provided that a large normative corpus is available.
VERTa: Linguistic features in MT evaluation
Elisabet Comelles
Jordi Atserias
Victoria Arranz
Irene Castellón
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
In the last decades, a wide range of automatic metrics that use linguistic knowledge has been developed. Some of them are based on lexical information, such as METEOR; others rely on the use of syntax, either using constituent or dependency analysis; and others use semantic information, such as Named Entities and semantic roles. All these metrics work at a specific linguistic level, but some researchers have tried to combine linguistic information, either by combining several metrics following a machine-learning approach or focusing on the combination of a wide variety of metrics in a simple and straightforward way. However, little research has been conducted on how to combine linguistic features from a linguistic point of view. In this paper we present VERTa, a metric which aims at using and combining a wide variety of linguistic features at lexical, morphological, syntactic and semantic level. We provide a description of the metric and report some preliminary experiments which will help us to discuss the use and combination of certain linguistic features in order to improve the metric performance
Active Learning for Building a Corpus of Questions for Parsing
Jordi Atserias
Giuseppe Attardi
Maria Simi
Hugo Zaragoza
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
This paper describes how we built a dependency Treebank for questions. The questions for the Treebank were drawn from questions from the TREC 10 QA task and from Yahoo! Answers. Among the uses for the corpus is to train a dependency parser achieving good accuracy on parsing questions without hurting its overall accuracy. We also explore active learning techniques to determine the suitable size for a corpus of questions in order to achieve adequate accuracy while minimizing the annotation efforts.
Complete and Consistent Annotation of WordNet using the Top Concept Ontology
Javier Álvez
Jordi Atserias
Jordi Carrera
Salvador Climent
Egoitz Laparra
Antoni Oliver
German Rigau
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
This paper presents the complete and consistent ontological annotation of the nominal part of WordNet. The annotation has been carried out using the semantic features defined in the EuroWordNet Top Concept Ontology and made available to the NLP community. Up to now only an initial core set of 1,024 synsets, the so-called Base Concepts, was ontologized in such a way. The work has been achieved by following a methodology based on an iterative and incremental expansion of the initial labeling through the hierarchy while setting inheritance blockage points. Since this labeling has been set on the EuroWordNets Interlingual Index (ILI), it can be also used to populate any other wordnet linked to it through a simple porting process. This feature-annotated WordNet is intended to be useful for a large number of semantic NLP tasks and for testing for the first time componential analysis on real environments. Moreover, the quantitative analysis of the work shows that more than 40% of the nominal part of WordNet is involved in structure errors or inadequacies.
Semantically Annotated Snapshot of the English Wikipedia
Jordi Atserias
Hugo Zaragoza
Massimiliano Ciaramita
Giuseppe Attardi
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
This paper describes SW1, the first version of a semantically annotated snapshot of the English Wikipedia. In recent years Wikipedia has become a valuable resource for both the Natural Language Processing (NLP) community and the Information Retrieval (IR) community. Although NLP technology for processing Wikipedia already exists, not all researchers and developers have the computational resources to process such a volume of information. Moreover, the use of different versions of Wikipedia processed differently might make it difficult to compare results. The aim of this work is to provide easy access to syntactic and semantic annotations for researchers of both NLP and IR communities by building a reference corpus to homogenize experiments and make results comparable. These resources, a semantically annotated corpus and a entity containment derived graph, are licensed under the GNU Free Documentation License and available from http://www.yr-bcn.es/semanticWikipedia
FreeLing 1.3: Syntactic and semantic services in an open-source NLP library
J. Atserias
B. Casas
E. Comelles
M. González
L. Padró
M. Padró
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
This paper describes version 1.3 of the FreeLing suite of NLP tools. FreeLing was first released in February 2004 providing morphological analysis and PoS tagging for Catalan, Spanish, and English. From then on, the package has been improved and enlarged to cover more languages (i.e. Italian and Galician) and offer more services: Named entity recognition and classification, chunking, dependency parsing, and WordNet based semantic annotation. FreeLing is not conceived as end-user oriented tool, but as library on top of which powerful NLP applications can be developed. Nevertheless, sample interface programs are provided, which can be straightforwardly used as fast, flexible, and efficient corpus processing tools. A remarkable feature of FreeLing is that it is distributed under a free-software LGPL license, thus enabling any developer to adapt the package to his needs in order to get the most suitable behaviour for the application being developed.
Automatic Acquisition of Sense Examples Using ExRetriever
Juan Fernández
Mauro Castillo
German Rigau
Jordi Atserias
Jordi Turmo
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
Towards the Meaning Top Ontology: Sources of Ontological Meaning
Jordi Atserias
Salvador Climent
German Rigau
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
Cross-Language Acquisition of Semantic Models for Verbal Predicates
Jordi Atserias
Bernardo Magnini
Octavian Popescu
Eneko Agirre
Aitziber Atutxa
German Rigau
John Carroll
Rob Koeling
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
Spanish WordNet 1.6: Porting the Spanish Wordnet Across Princeton Versions
Jordi Atserias
Luís Villarejo
German Rigau
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
Combining Unsupervised Lexical Knowledge Methods for Word Sense Disambiguation
German Rigau
Jordi Atserias
Eneko Agirre
35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics