José João Almeida

Also published as: Jose Joao Almeida


Enriching a Portuguese WordNet using Synonyms from a Monolingual Dictionary
Alberto Simões | Xavier Gómez Guinovart | José João Almeida
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this article we present an exploratory approach to enrich a WordNet-like lexical ontology with the synonyms present in a standard monolingual Portuguese dictionary. The dictionary was converted from PDF into XML and senses were automatically identified and annotated. This allowed us to extract them, independently of definitions, and to create sets of synonyms (synsets). These synsets were then aligned with WordNet synsets, both in the same language (Portuguese) and projecting the Portuguese terms into English, Spanish and Galician. This process allowed both the addition of new term variants to existing synsets, as to create new synsets for Portuguese.


The Minho Quotation Resource
Brett Drury | José João Almeida
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Direct quotations from business leaders can provide a rich sample of language which is in common use in the world of commerce. This language used by business leaders often uses: metaphors, euphemisms, slang, obscenities and invented words. In addition the business lexicon is dynamic because new words or terms will gain popularity with businessmen whilst obsolete words will exit their common vocabulary. In addition to being a rich source of language direct quotations from business leaders can have ''real world'' consequences. For example, Gerald Ratner nearly bankrupted his company with an infamous candid comment at an Institute of Directors meeting in 1993. Currently, there is no ''direct quotations from business leaders'' resource freely available to the research community. The ''Minho Quotation Resource'' captures the business lexicon with in excess of 500,000 quotations from individuals from the business world. The quotations were captured from October 2009 and April 2011. The resource is available in a searchable Lucene index and will be available for download in May 2012

Structural alignment of plain text books
André Santos | José João Almeida | Nuno Carvalho
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Text alignment is one of the main processes for obtaining parallel corpora. When aligning two versions of a book, results are often affected by unpaired sections ― sections which only exist in one of the versions of the book. We developed Text::Perfide::BookSync, a Perl module which performs books synchronization (structural alignment based on section delimitation), provided they have been previously annotated by Text::Perfide::BookCleaner. We discuss the need for such a tool and several implementation decisions. The main functions are described, and examples of input and output are presented. Text::Perfide::PartialAlign is an extension of the tool bundled with hunalign which proposes an alternative methods for splitting bitexts.


pdf bib
Guided Self Training for Sentiment Classification
Brett Drury | Luís Torgo | Jose Joao Almeida
Proceedings of Workshop on Robust Unsupervised and Semisupervised Methods in Natural Language Processing


Processing and Extracting Data from Dicionário Aberto
Alberto Simões | José João Almeida | Rita Farinha
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Synonyms dictionaries are useful resources for natural language processing. Unfortunately their availability in digital format is limited, as publishing companies do not release their dictionaries in open digital formats. Dicionário-Aberto (Simões and Farinha, 2010) is an open and free digital synonyms dictionary for the Portuguese language. It is under public domain and in textual digital format, which makes it usable for any task. Synonyms dictionaries are commonly used for the extraction of relations between words, the construction of complex structures like ontologies or thesaurus (comparable to WordNet (Miller et al., 1990)), or just the extraction of lists of words of specific type. This article will present Dicionário-Aberto, discussing how it was created, its main characteristics, the type of information present on it and the formats in which it is available. Follows the description of an API designed specifically to help Dicionário-Aberto processing without the need to tackle with the dictionary format. Finally, we will analyze the results on some data extraction experiments, extracting lists of words from a specific class, and extracting relationships between words.

Bigorna – A Toolkit for Orthography Migration Challenges
José João Almeida | André Santos | Alberto Simões
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Languages are born, evolve and, eventually, die. During this evolution their spelling rules (and sometimes the syntactic and semantic ones) change, putting old documents out of use. In Portugal, a pair of political agreements with Brazil forced relevant changes on the way the Portuguese language is written. In this article we will detail these two Orthographic Agreements (one in the thirties and the other more recently, in the nineties), and the challenges present on the automatic migration of old documents spelling to their actual one. We will reveal Bigorna, a toolkit for the classification of language variants, their comparison and the conversion of texts in different language versions. These tools will be explained together with examples of migration issues. As Birgorna relies on a set of conversion rules we will also discuss how to infer conversion rules from a set of documents (texts with different ages). The document concludes with a brief evaluation on the conversion and classification tool results and their relevance in the current Portuguese language scenario.


T2O - Recycling Thesauri into a Multilingual Ontology
Alberto Simões | José João Almeida
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this article we present T2O - a workbench to assist the process of translating heterogeneous resources into ontologies, to enrich and add multilingual information, to help programming with them, and to support ontology publishing. T2O is an ontology algebra.