Jorge Baptista


Support Verb Constructions across the Ocean Sea
Jorge Baptista | Nuno Mamede | Sónia Reis
Proceedings of the 18th Workshop on Multiword Expressions @LREC2022

This paper analyses the support (or light) verb constructions (SVC) in a publicly available, manually annotated corpus of multiword expressions (MWE) in Brazilian Portuguese. The paper highlights several issues in the linguistic definitions therein adopted for these types of MWE, and reports the results from applying STRING, a rule-based parsing system, originally developed for European Portuguese, to this corpus from Brazilian Portuguese. The goal is two-fold: to improve the linguistic definition of SVC in the annotation task, as well as to gauge the major difficulties found when transposing linguistic resources between these two varieties of the same language.


Discriminating between Similar Languages Using a Combination of Typed and Untyped Character N-grams and Words
Helena Gomez | Ilia Markov | Jorge Baptista | Grigori Sidorov | David Pinto
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

This paper presents the cic_ualg’s system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year’s task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all languages) approach and a two-step (language group and then languages within the group) approach. Features exploited include lexical features (unigrams of words) and character n-grams. Besides traditional (untyped) character n-grams, we introduce typed character n-grams in the DSL task. Experiments were carried out with different feature representation methods (binary and raw term frequency), frequency threshold values, and machine-learning algorithms – Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB). Our best run in the DSL task achieved 91.46% accuracy.

Os Provérbios em manuais de ensino de Português Língua Não Materna (The Proverbs of teaching manuals in Non-Native Portuguese)[In Portuguese]
Sónia Reis | Jorge Baptista
Proceedings of the 11th Brazilian Symposium in Information and Human Language Technology


metaTED: a Corpus of Metadiscourse for Spoken Language
Rui Correia | Nuno Mamede | Jorge Baptista | Maxine Eskenazi
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper describes metaTED ― a freely available corpus of metadiscursive acts in spoken language collected via crowdsourcing. Metadiscursive acts were annotated on a set of 180 randomly chosen TED talks in English, spanning over different speakers and topics. The taxonomy used for annotation is composed of 16 categories, adapted from Adel(2010). This adaptation takes into account both the material to annotate and the setting in which the annotation task is performed. The crowdsourcing setup is described, including considerations regarding training and quality control. The collected data is evaluated in terms of quantity of occurrences, inter-annotator agreement, and annotation related measures (such as average time on task and self-reported confidence). Results show different levels of agreement among metadiscourse acts (α ∈ [0.15; 0.49]). To further assess the collected material, a subset of the annotations was submitted to expert appreciation, who validated which of the marked occurrences truly correspond to instances of the metadiscursive act at hand. Similarly to what happened with the crowd, experts revealed different levels of agreement between categories (α ∈ [0.18; 0.72]). The paper concludes with a discussion on the applicability of metaTED with respect to each of the 16 categories of metadiscourse.


Integrating support verb constructions into a parser
Amanda Rassi | Jorge Baptista | Nuno Mamede | Oto Vale
Proceedings of the 10th Brazilian Symposium in Information and Human Language Technology

Novo dicionário de formas flexionadas do Unitex-PB: avaliação da flexão verbal (New Dictionary of Inflected forms of UNITEX-PB: Evaluation of Verbal Inflection)
Oto A. Vale | Jorge Baptista
Proceedings of the 10th Brazilian Symposium in Information and Human Language Technology


Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing

The fuzzy boundaries of operator verb and support verb constructions with dar “give” and ter “have” in Brazilian Portuguese
Amanda Rassi | Cristina Santos-Turati | Jorge Baptista | Nuno Mamede | Oto Vale
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing


Spanish Adverbial Frozen Expressions
Dolors Català | Jorge Baptista
Proceedings of the Workshop on A Broader Perspective on Multiword Expressions


Frozen Sentences of Portuguese: Formal Descriptions for NLP
Jorge Baptista | Anabela Correia | Graça Fernandes
Proceedings of the Workshop on Multiword Expressions: Integrating Processing


A Computational Lexicon of Portuguese for Automatic Text Parsing
Ehsabete Ranchhod | Cristina Mota | Jorge Baptista
SIGLEX99: Standardizing Lexical Resources