Eduardo Ferreira

2016

pdf abs
B2SG: a TOEFL-like Task for Portuguese
Rodrigo Wilkens | Leonardo Zilio | Eduardo Ferreira | Aline Villavicencio
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Resources such as WordNet are useful for NLP applications, but their manual construction consumes time and personnel, and frequently results in low coverage. One alternative is the automatic construction of large resources from corpora like distributional thesauri, containing semantically associated words. However, as they may contain noise, there is a strong need for automatic ways of evaluating the quality of the resulting resource. This paper introduces a gold standard that can aid in this task. The BabelNet-Based Semantic Gold Standard (B2SG) was automatically constructed based on BabelNet and partly evaluated by human judges. It consists of sets of tests that present one target word, one related word and three unrelated words. B2SG contains 2,875 validated relations: 800 for verbs and 2,075 for nouns; these relations are divided among synonymy, antonymy and hypernymy. They can be used as the basis for evaluating the accuracy of the similarity relations on distributional thesauri by comparing the proximity of the target word with the related and unrelated options and observing if the related word has the highest similarity value among them. As a case study two distributional thesauri were also developed: one using surface forms from a large (1.5 billion word) corpus and the other using lemmatized forms from a smaller (409 million word) corpus. Both distributional thesauri were then evaluated against B2SG, and the one using lemmatized forms performed slightly better.

2015

pdf
Distributional Thesauri for Portuguese: methodology evaluation
Rodrigo Wilkens | Leonardo Zilio | Eduardo Ferreira | Gabriel Gonçalves | Aline Villavicencio
Proceedings of the 10th Brazilian Symposium in Information and Human Language Technology

2009

2006

pdf abs
Open Resources and Tools for the Shallow Processing of Portuguese: The TagShare Project
Florbela Barreto | António Branco | Eduardo Ferreira | Amália Mendes | Maria Fernanda Bacelar do Nascimento | Filipe Nunes | João Ricardo Silva
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper presents the TagShare project and the linguistic resources and tools for the shallow processing of Portuguese developed in its scope. These resources include a 1 million token corpus that has been accurately hand annotated with a variety of linguistic information, as well as several state of the art shallow processing tools capable of automatically producing that type of annotation. At present, the linguistic annotations in the corpus are sentence and paragraph boundaries, token boundaries, morphosyntactic POS categories, values of inflection features, lemmas and namedentities. Hence, the set of tools comprise a sentence chunker, a tokenizer, a POS tagger, nominal and verbal analyzers and lemmatizers, a verbal conjugator, a nominal inflector, and a namedentity recognizer, some of which underline several online services.