Gustavo Paetzold

Also published as: Gustavo H. Paetzold, Gustavo Henrique Paetzold, Gustavo Henrique Paetzold

2021

pdf bib abs
SemEval-2021 Task 1: Lexical Complexity Prediction
Matthew Shardlow | Richard Evans | Gustavo Henrique Paetzold | Marcos Zampieri
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This paper presents the results and main findings of SemEval-2021 Task 1 - Lexical Complexity Prediction. We provided participants with an augmented version of the CompLex Corpus (Shardlow et al. 2020). CompLex is an English multi-domain corpus in which words and multi-word expressions (MWEs) were annotated with respect to their complexity using a five point Likert scale. SemEval-2021 Task 1 featured two Sub-tasks: Sub-task 1 focused on single words and Sub-task 2 focused on MWEs. The competition attracted 198 teams in total, of which 54 teams submitted official runs on the test data to Sub-task 1 and 37 to Sub-task 2.

pdf abs
UTFPR at SemEval-2021 Task 1: Complexity Prediction by Combining BERT Vectors and Classic Features
Gustavo Henrique Paetzold
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

We describe the UTFPR systems submitted to the Lexical Complexity Prediction shared task of SemEval 2021. They perform complexity prediction by combining classic features, such as word frequency, n-gram frequency, word length, and number of senses, with BERT vectors. We test numerous feature combinations and machine learning models in our experiments and find that BERT vectors, even if not optimized for the task at hand, are a great complement to classic features. We also find that employing the principle of compositionality can potentially help in phrase complexity prediction. Our systems place 45th out of 55 for single words and 29th out of 38 for phrases.

2020

abs
SIMPLEX-PB 2.0: A Reliable Dataset for Lexical Simplification in Brazilian Portuguese
Nathan Hartmann | Gustavo Henrique Paetzold | Sandra Aluísio
Proceedings of the Fourth Widening Natural Language Processing Workshop

Most research on Lexical Simplification (LS) addresses non-native speakers of English, since they are numerous and easy to recruit. This makes it difficult to create LS solutions for other languages and target audiences. This paper presents SIMPLEX-PB 2.0, a dataset for LS in Brazilian Portuguese that, unlike its predecessor SIMPLEX-PB, accurately captures the needs of Brazilian underprivileged children. To create SIMPLEX-PB 2.0, we addressed all limitations of the old SIMPLEX-PB through multiple rounds of manual annotation. As a result, SIMPLEX-PB 2.0 features much more reliable and numerous candidate substitutions to complex words, as well as word complexity rankings produced by a group underprivileged children.

pdf abs
UTFPR at SemEval-2020 Task 7: Using Co-occurrence Frequencies to Capture Unexpectedness
Gustavo Henrique Paetzold
Proceedings of the Fourteenth Workshop on Semantic Evaluation

We describe the UTFPR system for SemEval-2020’s Task 7: Assessing Humor in Edited News Headlines. Ours is a minimalist unsupervised system that uses word co-occurrence frequencies from large corpora to capture unexpectedness as a mean to capture funniness. Our system placed 22nd on the shared task’s Task 2. We found that our approach requires more text than we used to perform reliably, and that unexpectedness alone is not sufficient to gauge funniness for humorous content that targets a diverse target audience.

pdf abs
UTFPR at SemEval 2020 Task 12: Identifying Offensive Tweets with Lightweight Ensembles
Marcos Aurélio Hermogenes Boriola | Gustavo Henrique Paetzold
Proceedings of the Fourteenth Workshop on Semantic Evaluation

Offensive language is a common issue on social media platforms nowadays. In an effort to address this issue, the SemEval 2020 event held the OffensEval 2020 shared task where the participants were challenged to develop systems that identify and classify offensive language in tweets. In this paper, we present a system that uses an Ensemble model stacking a BOW model and a CNN model that led us to place 29th in the ranking for English sub-task A.

2019

pdf abs
UTFPR at SemEval-2019 Task 5: Hate Speech Identification with Recurrent Neural Networks
Gustavo Henrique Paetzold | Marcos Zampieri | Shervin Malmasi
Proceedings of the 13th International Workshop on Semantic Evaluation

In this paper we revisit the problem of automatically identifying hate speech in posts from social media. We approach the task using a system based on minimalistic compositional Recurrent Neural Networks (RNN). We tested our approach on the SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter (HatEval) shared task dataset. The dataset made available by the HatEval organizers contained English and Spanish posts retrieved from Twitter annotated with respect to the presence of hateful content and its target. In this paper we present the results obtained by our system in comparison to the other entries in the shared task. Our system achieved competitive performance ranking 7th in sub-task A out of 62 systems in the English track.

pdf abs
UTFPR at SemEval-2019 Task 6: Relying on Compositionality to Find Offense
Gustavo Henrique Paetzold
Proceedings of the 13th International Workshop on Semantic Evaluation

We present the UTFPR system for the OffensEval shared task of SemEval 2019: A character-to-word-to-sentence compositional RNN model trained exclusively over the training data provided by the organizers. We find that, although not very competitive for the task at hand, it offers a robust solution to the orthographic irregularity inherent to tweets.

pdf abs
Experiments in Cuneiform Language Identification
Gustavo Henrique Paetzold | Marcos Zampieri
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects

This paper presents methods to discriminate between languages and dialects written in Cuneiform script, one of the first writing systems in the world. We report the results obtained by the PZ team in the Cuneiform Language Identification (CLI) shared task organized within the scope of the VarDial Evaluation Campaign 2019. The task included two languages, Sumerian and Akkadian. The latter is divided into six dialects: Old Babylonian, Middle Babylonian peripheral, Standard Babylonian, Neo Babylonian, Late Babylonian, and Neo Assyrian. We approach the task using a meta-classifier trained on various SVM models and we show the effectiveness of the system for this task. Our submission achieved 0.738 F1 score in discriminating between the seven languages and dialects and it was ranked fourth in the competition among eight teams.

2018

pdf abs
Lexi: A tool for adaptive, personalized text simplification
Joachim Bingel | Gustavo Paetzold | Anders Søgaard
Proceedings of the 27th International Conference on Computational Linguistics

Most previous research in text simplification has aimed to develop generic solutions, assuming very homogeneous target audiences with consistent intra-group simplification needs. We argue that this assumption does not hold, and that instead we need to develop simplification systems that adapt to the individual needs of specific users. As a first step towards personalized simplification, we propose a framework for adaptive lexical simplification and introduce Lexi, a free open-source and easily extensible tool for adaptive, personalized text simplification. Lexi is easily installed as a browser extension, enabling easy access to the service for its users.

pdf
Text Simplification from Professionally Produced Corpora
Carolina Scarton | Gustavo Paetzold | Lucia Specia
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
SimPA: A Sentence-Level Simplification Corpus for the Public Administration Domain
Carolina Scarton | Gustavo Paetzold | Lucia Specia
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

We report the findings of the second Complex Word Identification (CWI) shared task organized as part of the BEA workshop co-located with NAACL-HLT’2018. The second CWI shared task featured multilingual and multi-genre datasets divided into four tracks: English monolingual, German monolingual, Spanish monolingual, and a multilingual track with a French test set, and two tasks: binary classification and probabilistic classification. A total of 12 teams submitted their results in different task/track combinations and 11 of them wrote system description papers that are referred to in this report and appear in the BEA workshop proceedings.

pdf abs
UTFPR at IEST 2018: Exploring Character-to-Word Composition for Emotion Analysis
Gustavo Paetzold
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

We introduce the UTFPR system for the Implicit Emotions Shared Task of 2018: A compositional character-to-word recurrent neural network that does not exploit heavy and/or hard-to-obtain resources. We find that our approach can outperform multiple baselines, and offers an elegant and effective solution to the problem of orthographic variance in tweets.

pdf abs
UTFPR at WMT 2018: Minimalistic Supervised Corpora Filtering for Machine Translation
Gustavo Paetzold
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

We present the UTFPR systems at the WMT 2018 parallel corpus filtering task. Our supervised approach discerns between good and bad translations by training classic binary classification models over an artificially produced binary classification dataset derived from a high-quality translation set, and a minimalistic set of 6 semantic distance features that rely only on easy-to-gather resources. We rank translations by their probability for the “good” label. Our results show that logistic regression pairs best with our approach, yielding more consistent results throughout the different settings evaluated.

2017

pdf abs
Learning How to Simplify From Explicit Labeling of Complex-Simplified Text Pairs
Fernando Alva-Manchego | Joachim Bingel | Gustavo Paetzold | Carolina Scarton | Lucia Specia
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Current research in text simplification has been hampered by two central problems: (i) the small amount of high-quality parallel simplification data available, and (ii) the lack of explicit annotations of simplification operations, such as deletions or substitutions, on existing data. While the recently introduced Newsela corpus has alleviated the first problem, simplifications still need to be learned directly from parallel text using black-box, end-to-end approaches rather than from explicit annotations. These complex-simple parallel sentence pairs often differ to such a high degree that generalization becomes difficult. End-to-end models also make it hard to interpret what is actually learned from data. We propose a method that decomposes the task of TS into its sub-problems. We devise a way to automatically identify operations in a parallel corpus and introduce a sequence-labeling approach based on these annotations. Finally, we provide insights on the types of transformations that different approaches can model.

pdf bib abs
MASSAlign: Alignment and Annotation of Comparable Documents
Gustavo Paetzold | Fernando Alva-Manchego | Lucia Specia
Proceedings of the IJCNLP 2017, System Demonstrations

We introduce MASSAlign: a Python library for the alignment and annotation of monolingual comparable documents. MASSAlign offers easy-to-use access to state of the art algorithms for paragraph and sentence-level alignment, as well as novel algorithms for word-level annotation of transformation operations between aligned sentences. In addition, MASSAlign provides a visualization module to display and analyze the alignments and annotations performed.

pdf abs
The Ultimate Presentation Makeup Tutorial: How to Polish your Posters, Slides and Presentations Skills
Gustavo Paetzold | Lucia Specia
Proceedings of the IJCNLP 2017, Tutorial Abstracts

There is no question that our research community have, and still has been producing an insurmountable amount of interesting strategies, models and tools to a wide array of problems and challenges in diverse areas of knowledge. But for as long as interesting work has existed, we’ve been plagued by a great unsolved mystery: how come there is so much interesting work being published in conferences, but not as many interesting and engaging posters and presentations being featured in them? In this tutorial, we present practical step-by-step makeup solutions for poster, slides and oral presentations in order to help researchers who feel like they are not able to convey the importance of their research to the community in conferences.

pdf abs
Lexical Simplification with Neural Ranking
Gustavo Paetzold | Lucia Specia
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

We present a new Lexical Simplification approach that exploits Neural Networks to learn substitutions from the Newsela corpus - a large set of professionally produced simplifications. We extract candidate substitutions by combining the Newsela corpus with a retrofitted context-aware word embeddings model and rank them using a new neural regression model that learns rankings from annotated data. This strategy leads to the highest Accuracy, Precision and F1 scores to date in standard datasets for the task.

pdf
Feature-Enriched Character-Level Convolutions for Text Regression
Gustavo Paetzold | Lucia Specia
Proceedings of the Second Conference on Machine Translation

pdf abs
Complex Word Identification: Challenges in Data Annotation and System Performance
Marcos Zampieri | Shervin Malmasi | Gustavo Paetzold | Lucia Specia
Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017)

This paper revisits the problem of complex word identification (CWI) following up the SemEval CWI shared task. We use ensemble classifiers to investigate how well computational methods can discriminate between complex and non-complex words. Furthermore, we analyze the classification performance to understand what makes lexical complexity challenging. Our findings show that most systems performed poorly on the SemEval CWI dataset, and one of the reasons for that is the way in which human annotation was performed.

pdf bib
Proceedings of the 11th Brazilian Symposium in Information and Human Language Technology
Gustavo Henrique Paetzold | Vládia Pinheiro
Proceedings of the 11th Brazilian Symposium in Information and Human Language Technology

2016

pdf
SemEval 2016 Task 11: Complex Word Identification
Gustavo Paetzold | Lucia Specia
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf
SV000gg at SemEval-2016 Task 11: Heavy Gauge Complex Word Identification with System Voting
Gustavo Paetzold | Lucia Specia
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf
Inferring Psycholinguistic Properties of Words
Gustavo Paetzold | Lucia Specia
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
SHEF-MIME: Word-level Quality Estimation Using Imitation Learning
Daniel Beck | Andreas Vlachos | Gustavo Paetzold | Lucia Specia
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf
SimpleNets: Quality Estimation with Resource-Light Neural Networks
Gustavo Paetzold | Lucia Specia
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf abs
Benchmarking Lexical Simplification Systems
Gustavo Paetzold | Lucia Specia
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Lexical Simplification is the task of replacing complex words in a text with simpler alternatives. A variety of strategies have been devised for this challenge, yet there has been little effort in comparing their performance. In this contribution, we present a benchmarking of several Lexical Simplification systems. By combining resources created in previous work with automatic spelling and inflection correction techniques, we introduce BenchLS: a new evaluation dataset for the task. Using BenchLS, we evaluate the performance of solutions for various steps in the typical Lexical Simplification pipeline, both individually and jointly. This is the first time Lexical Simplification systems are compared in such fashion on the same data, and the findings introduce many contributions to the field, revealing several interesting properties of the systems evaluated.

pdf abs
Understanding the Lexical Simplification Needs of Non-Native Speakers of English
Gustavo Paetzold | Lucia Specia
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

We report three user studies in which the Lexical Simplification needs of non-native English speakers are investigated. Our analyses feature valuable new insight on the relationship between the non-natives’ notion of complexity and various morphological, semantic and lexical word properties. Some of our findings contradict long-standing misconceptions about word simplicity. The data produced in our studies consists of 211,564 annotations made by 1,100 volunteers, which we hope will guide forthcoming research on Text Simplification for non-native speakers of English.

pdf abs
Collecting and Exploring Everyday Language for Predicting Psycholinguistic Properties of Words
Gustavo Paetzold | Lucia Specia
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Exploring language usage through frequency analysis in large corpora is a defining feature in most recent work in corpus and computational linguistics. From a psycholinguistic perspective, however, the corpora used in these contributions are often not representative of language usage: they are either domain-specific, limited in size, or extracted from unreliable sources. In an effort to address this limitation, we introduce SubIMDB, a corpus of everyday language spoken text we created which contains over 225 million words. The corpus was extracted from 38,102 subtitles of family, comedy and children movies and series, and is the first sizeable structured corpus of subtitles made available. Our experiments show that word frequency norms extracted from this corpus are more effective than those from well-known norms such as Kucera-Francis, HAL and SUBTLEXus in predicting various psycholinguistic properties of words, such as lexical decision times, familiarity, age of acquisition and simplicity. We also provide evidence that contradict the long-standing assumption that the ideal size for a corpus can be determined solely based on how well its word frequencies correlate with lexical decision times.

pdf abs
Anita: An Intelligent Text Adaptation Tool
Gustavo Paetzold | Lucia Specia
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

We introduce Anita: a flexible and intelligent Text Adaptation tool for web content that provides Text Simplification and Text Enhancement modules. Anita’s simplification module features a state-of-the-art system that adapts texts according to the needs of individual users, and its enhancement module allows the user to search for a word’s definitions, synonyms, translations, and visual cues through related images. These utilities are brought together in an easy-to-use interface of a freely available web browser extension.

pdf abs
Quality Estimation for Language Output Applications
Carolina Scarton | Gustavo Paetzold | Lucia Specia
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Tutorial Abstracts

Quality Estimation (QE) of language output applications is a research area that has been attracting significant attention. The goal of QE is to estimate the quality of language output applications without the need of human references. Instead, machine learning algorithms are used to build supervised models based on a few labelled training instances. Such models are able to generalise over unseen data and thus QE is a robust method applicable to scenarios where human input is not available or possible. One such a scenario where QE is particularly appealing is that of Machine Translation, where a score for predicted quality can help decide whether or not a translation is useful (e.g. for post-editing) or reliable (e.g. for gisting). Other potential applications within Natural Language Processing (NLP) include Text Summarisation and Text Simplification. In this tutorial we present the task of QE and its application in NLP, focusing on Machine Translation. We also introduce QuEst++, a toolkit for QE that encompasses feature extraction and machine learning, and propose a practical activity to extend this toolkit in various ways.