João Graça

Also published as: Joao Graca, João V. Graça


2020

This paper presents the Multilingual Artificial Intelligence Agent Assistant (MAIA), a project led by Unbabel with the collaboration of CMU, INESC-ID and IT Lisbon. MAIA will employ cutting-edge machine learning and natural language processing technologies to build multilingual AI agent assistants, eliminating language barriers. MAIA’s translation layer will empower human agents to provide customer support in real-time, in any language, with human quality.

2018

2017

2012

With the CINTIL-International Corpus of Portuguese, an ongoing corpus annotated with fully flegded grammatical representation, sentences get not only a high level of lexical, morphological and syntactic annotation but also a semantic analysis that prepares the data to a manual specification step and thus opens the way for a number of tools and resources for which there is a great research focus at the present. This paper reports on the construction of a propbank that builds on CINTIL-DeepGramBank, with nearly 10 thousand sentences, on the basis of a deep linguistic grammar and on the process and the linguistic criteria guiding that construction, which makes possible to obtain a complete PropBank with both syntactic and semantic levels of linguistic annotation. Taking into account this and the promising scores presented in this study for inter-annotator agreement, CINTIL-PropBank presents itself as a great resource to train a semantic role labeller, one of our goals with this project.

2011

2010

In this paper we describe the Instituto de Engenharia de Sistemas e Computadores Investigac ̧a ̃o e Desenvolvimento (INESC-ID) system that participated in the IWSLT 2010 evaluation campaign. Our main goal for this evaluation was to employ several state-of-the-art methods applied to phrase-based machine translation in order to improve the translation quality. Aside from the IBM M4 alignment model, two constrained alignment models were tested, which produced better overall results. These results were further improved by using weighted alignment matrixes during phrase extraction, rather than the single best alignment. Finally, we tested several filters that ruled out phrase pairs based on puntuation. Our system was evaluated on the BTEC and DIALOG tasks, having achieved a better overall ranking in the DIALOG task.
Phrase-based systems deeply depend on the quality of their phrase tables and therefore, the process of phrase extraction is always a fundamental step. In this paper we present a general and extensible phrase extraction algorithm, where we have highlighted several control points. The instantiation of these control points allows the simulation of previous approaches, as in each one of these points different strategies/heuristics can be tested. We show how previous approaches fit in this algorithm, compare several of them and, in addition, we propose alternative heuristics, showing their impact on the final translation results. Considering two different test scenarios from the IWSLT 2010 competition (BTEC, Fr-En and DIALOG, Cn-En), we have obtained an improvement in the results of 2.4 and 2.8 BLEU points, respectively.
Corpora of sentences annotated with grammatical information have been deployed by extending the basic lexical and morphological data with increasingly complex information, such as phrase constituency, syntactic functions, semantic roles, etc. As these corpora grow in size and the linguistic information to be encoded reaches higher levels of sophistication, the utilization of annotation tools and, above all, supporting computational grammars appear no longer as a matter of convenience but of necessity. In this paper, we report on the design features, the development conditions and the methodological options of a deep linguistic databank, the CINTIL DeepGramBank. In this corpus, sentences are annotated with fully fledged linguistically informed grammatical representations that are produced by a deep linguistic processing grammar, thus consistently integrating morphological, syntactic and semantic information. We also report on how such corpus permits to straightforwardly obtain a whole range of past generation annotated corpora (POS, NER and morphology), current generation treebanks (constituency treebanks, dependency banks, propbanks) and next generation databanks (logical form banks) simply by means of a very residual selection/extraction effort to get the appropriate ""views"" exposing the relevant layers of information.

2008

This paper reports an experience on producing manual word alignments over six different language pairs (all combinations between Portuguese, English, French and Spanish) (Graça et al., 2008). Word alignment of each language pair is made over the first 100 sentences of the common test set from the Europarl corpora (Koehn, 2005), corresponding to 600 new annotated sentences. This collection is publicly available at http://www.l2f.inesc- id.pt/resources/translation/. It contains, to our knowledge, the first word alignment gold set for the Portuguese language, with three other languages. Besides, it is to our knowledge, the first multi-language manual word aligned parallel corpus, where the same sentences are annotated for each language pair. We started by using the guidelines presented at (Mariño, 2005) and performed several refinements: some due to under-specifications on the original guidelines, others because of disagreement on some choices. This lead to the development of an extensive new set of guidelines for multi-lingual word alignment annotation that, we believe, makes the alignment process less ambiguous. We evaluate the inter-annotator agreement obtaining an average of 91.6% agreement between the different language pairs.

2007

We present the machine translation system used by L2F from INESC-ID in the evaluation campaign of the International Workshop on Spoken Language Translation (2007), in the task of translating spontaneous conversations in the travel domain from Italian to English.