2018
pdf
abs
Paraphrastic Variance between European and Brazilian Portuguese
Anabela Barreiro
|
Cristina Mota
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)
This paper presents a methodology to extract a paraphrase database for the European and Brazilian varieties of Portuguese, and discusses a set of paraphrastic categories of multiwords and phrasal units, such as the compounds “toda a gente” versus “todo o mundo” ‘everybody’ or the gerundive constructions [estar a + V-Inf] versus [ficar + V-Ger] (e.g., “estive a observar” | “fiquei observando” ‘I was observing’), which are extremely relevant to high quality paraphrasing. The variants were manually aligned in the e-PACT corpus, using the CLUE-Aligner tool. The methodology, inspired in the Logos Model, focuses on a semantico-syntactic analysis of each paraphrastic unit and constitutes a subset of the Gold-CLUE-Paraphrases. The construction of a larger dataset of paraphrastic contrasts among the distinct varieties of the Portuguese language is indispensable for variety adaptation, i.e., for dealing with the cultural, linguistic and stylistic differences between them, making it possible to convert texts (semi-)automatically from one variety into another, a key function in paraphrasing systems. This topic represents an interesting new line of research with valuable applications in language learning, language generation, question-answering, summarization, and machine translation, among others. The paraphrastic units are the first resource of its kind for Portuguese to become available to the scientific community for research purposes.
2016
pdf
abs
Port4NooJ v3.0: Integrated Linguistic Resources for Portuguese NLP
Cristina Mota
|
Paula Carvalho
|
Anabela Barreiro
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
This paper introduces Port4NooJ v3.0, the latest version of the Portuguese module for NooJ, highlights its main features, and details its three main new components: (i) a lexicon-grammar based dictionary of 5,177 human intransitive adjectives, and a set of local grammars that use the distributional properties of those adjectives for paraphrasing (ii) a polarity dictionary with 9,031 entries for sentiment analysis, and (iii) a set of priority dictionaries and local grammars for named entity recognition. These new components were derived and/or adapted from publicly available resources. The Port4NooJ v3.0 resource is innovative in terms of the specificity of the linguistic knowledge it incorporates. The dictionary is bilingual Portuguese-English, and the semantico-syntactic information assigned to each entry validates the linguistic relation between the terms in both languages. These characteristics, which cannot be found in any other public resource for Portuguese, make it a valuable resource for translation and paraphrasing. The paper presents the current statistics and describes the different complementary and synergic components and integration efforts.
2012
pdf
abs
Págico: Evaluating Wikipedia-based information retrieval in Portuguese
Cristina Mota
|
Alberto Simões
|
Cláudia Freitas
|
Luís Costa
|
Diana Santos
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
How do people behave in their everyday information seeking tasks, which often involve Wikipedia? Are there systems which can help them, or do a similar job? In this paper we describe Págico, an evaluation contest with the main purpose of fostering research in these topics. We describe its motivation, the collection of documents created, the evaluation setup, the topics chosen and their choice, the participation, as well as the measures used for evaluation and the gathered resources. The task―between information retrieval and question answering―can be further described as answering questions related to Portuguese-speaking culture in the Portuguese Wikipedia, in a number of different themes and geographic and temporal angles. This initiative allowed us to create interesting datasets and perform some assessment of Wikipedia, while also improving a public-domain open-source system for further wikipedia-based evaluations. In the paper, we provide examples of questions, we report the results obtained by the participants, and provide some discussion on complex issues.
2010
pdf
abs
Second HAREM: Advancing the State of the Art of Named Entity Recognition in Portuguese
Cláudia Freitas
|
Cristina Mota
|
Diana Santos
|
Hugo Gonçalo Oliveira
|
Paula Carvalho
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
In this paper, we present Second HAREM, the second edition of an evaluation campaign for Portuguese, addressing named entity recognition (NER). This second edition also included two new tracks: the recognition and normalization of temporal entities (proposed by a group of participants, and hence not covered on this paper) and ReRelEM, the detection of semantic relations between named entities. We summarize the setup of Second HAREM by showing the preserved distinctive features and discussing the changes compared to the first edition. Furthermore, we present the main results achieved and describe the available resources and tools developed under this evaluation, namely,(i) the golden collections, i.e. a set of documents whose named entities and semantic relations between those entities were manually annotated, (ii) the Second HAREM collection (which contains the unannotated version of the golden collection), as well as the participating systems results on it, (iii) the scoring tools, and (iv) SAHARA, a Web application that allows interactive evaluation. We end the paper by offering some remarks about what was learned.
pdf
abs
Experiments in Human-computer Cooperation for the Semantic Annotation of Portuguese Corpora
Diana Santos
|
Cristina Mota
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
In this paper, we present a system to aid human annotation of semantic information in the scope of the project AC/DC, called corte-e-costura. This system leverages on the human annotation effort, by providing the annotator with a simple system that applies rules incrementally. Our goal was twofold: first, to develop an easy-to-use system that required a minimum of learning from the part of the linguist; second, one that provided a straightforward way of checking the results obtained, in order to immediately evaluate the results of the rules devised. After explaining the motivation for its development from scratch, we present the current status of the AC/DC project and provide a quantitative description of its material in what concerns semantic annotation. We then present the corte-e-costura system in detail, providing the result of our first experiments with the semantic fields of colour and clothing. We end the paper with some discussion of future work as well as of the experience gained.
2009
pdf
Relation detection between named entities: report of a shared task
Cláudia Freitas
|
Diana Santos
|
Cristina Mota
|
Hugo Gonçalo Oliveira
|
Paula Carvalho
Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-2009)
pdf
Updating a Name Tagger Using Contemporary Unlabeled Data
Cristina Mota
|
Ralph Grishman
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
2008
pdf
abs
Is this NE tagger getting old?
Cristina Mota
|
Ralph Grishman
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
This paper focuses on the influence of changing the text time frame on the performance of a named entity tagger. We followed a twofold approach to investigate this subject: on the one hand, we analyzed a corpus that spans 8 years, and, on the other hand, we assessed the performance of a name tagger trained and tested on that corpus. We created 8 samples from the corpus, each drawn from the articles for a particular year. In terms of corpus analysis, we calculated the corpus similarity and names shared between samples. To see the effect on tagger performance, we implemented a semi-supervised name tagger based on co-training; then, we trained and tested our tagger on those samples. We observed that corpus similarity, names shared between samples, and tagger performance all decay as the time gap between the samples increases. Furthermore, we observed that the corpus similarity and names shared correlate with the tagger F-measure. These results show that named entity recognition systems may become obsolete in a short period of time.
2004
pdf
Portuguese Large-scale Language Resources for NLP Applications
Elisabete Ranchhod
|
Paula Carvalho
|
Cristina Mota
|
Anabela Barreiro
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
pdf
Multiword Lexical Acquisition and Dictionary Formalization
Cristina Mota
|
Paula Carvalho
|
Elisabete Ranchhod
Proceedings of the Workshop on Enhancing and Using Electronic Dictionaries
1999
pdf
A Computational Lexicon of Portuguese for Automatic Text Parsing
Ehsabete Ranchhod
|
Cristina Mota
|
Jorge Baptista
SIGLEX99: Standardizing Lexical Resources