Marcos Zampieri


2024

pdf bib
Countering Hateful and Offensive Speech Online - Open Challenges
Leon Derczynski | Marco Guerini | Debora Nozza | Flor Miriam Plaza-del-Arco | Jeffrey Sorensen | Marcos Zampieri
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

In today’s digital age, hate speech and offensive speech online pose a significant challenge to maintaining respectful and inclusive online environments. This tutorial aims to provide attendees with a comprehensive understanding of the field by delving into essential dimensions such as multilingualism, counter-narrative generation, a hands-on session with one of the most popular APIs for detecting hate speech, fairness, and ethics in AI, and the use of recent advanced approaches. In addition, the tutorial aims to foster collaboration and inspire participants to create safer online spaces by detecting and mitigating hate speech.

pdf bib
A Federated Learning Approach to Privacy Preserving Offensive Language Identification
Marcos Zampieri | Damith Premasiri | Tharindu Ranasinghe
Proceedings of the Fourth Workshop on Threat, Aggression & Cyberbullying @ LREC-COLING-2024

The spread of various forms of offensive speech online is an important concern in social media. While platforms have been investing heavily in ways of coping with this problem, the question of privacy remains largely unaddressed. Models trained to detect offensive language on social media are trained and/or fine-tuned using large amounts of data often stored in centralized servers. Since most social media data originates from end users, we propose a privacy preserving decentralized architecture for identifying offensive language online by introducing Federated Learning (FL) in the context of offensive language identification. FL is a decentralized architecture that allows multiple models to be trained locally without the need for data sharing hence preserving users’ privacy. We propose a model fusion approach to perform FL. We trained multiple deep learning models on four publicly available English benchmark datasets (AHSD, HASOC, HateXplain, OLID) and evaluated their performance in detail. We also present initial cross-lingual experiments in English and Spanish. We show that the proposed model fusion approach outperforms baselines in all the datasets while preserving privacy.

pdf bib
Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024)
Matthew Shardlow | Horacio Saggion | Fernando Alva-Manchego | Marcos Zampieri | Kai North | Sanja Štajner | Regina Stodden
Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024)

pdf bib
MultiLS: An End-to-End Lexical Simplification Framework
Kai North | Tharindu Ranasinghe | Matthew Shardlow | Marcos Zampieri
Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024)

Lexical Simplification (LS) automatically replaces difficult to read words for easier alternatives while preserving a sentence’s original meaning. Several datasets exist for LS and each of them specialize in one or two sub-tasks within the LS pipeline. However, as of this moment, no single LS dataset has been developed that covers all LS sub-tasks. We present MultiLS, the first LS framework that allows for the creation of a multi-task LS dataset. We also present MultiLS-PT, the first dataset created using the MultiLS framework. We demonstrate the potential of MultiLS-PT by carrying out all LS sub-tasks of (1) lexical complexity prediction (LCP), (2) substitute generation, and (3) substitute ranking for Portuguese.

pdf
An Extensible Massively Multilingual Lexical Simplification Pipeline Dataset using the MultiLS Framework
Matthew Shardlow | Fernando Alva-Manchego | Riza Batista-Navarro | Stefan Bott | Saul Calderon Ramirez | Rémi Cardon | Thomas François | Akio Hayakawa | Andrea Horbach | Anna Hülsing | Yusuke Ide | Joseph Marvin Imperial | Adam Nohejl | Kai North | Laura Occhipinti | Nelson Peréz Rojas | Nishat Raihan | Tharindu Ranasinghe | Martin Solis Salazar | Marcos Zampieri | Horacio Saggion
Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI) @ LREC-COLING 2024

We present preliminary findings on the MultiLS dataset, developed in support of the 2024 Multilingual Lexical Simplification Pipeline (MLSP) Shared Task. This dataset currently comprises of 300 instances of lexical complexity prediction and lexical simplification across 10 languages. In this paper, we (1) describe the annotation protocol in support of the contribution of future datasets and (2) present summary statistics on the existing data that we have gathered. Multilingual lexical simplification can be used to support low-ability readers to engage with otherwise difficult texts in their native, often low-resourced, languages.

pdf
Multilingual Resources for Lexical Complexity Prediction: A Review
Matthew Shardlow | Kai North | Marcos Zampieri
Proceedings of the Workshop on DeTermIt! Evaluating Text Difficulty in a Multilingual Context @ LREC-COLING 2024

Lexical complexity prediction is the NLP task aimed at using machine learning to predict the difficulty of a target word in context for a given user or user group. Multiple datasets exist for lexical complexity prediction, many of which have been published recently in diverse languages. In this survey, we discuss nine recent datasets (2018-2024) all of which provide lexical complexity prediction annotations. Particularly, we identified eight languages (French, Spanish, Chinese, German, Russian, Japanese, Turkish and Portuguese) with at least one lexical complexity dataset. We do not consider the English datasets, which have already received significant treatment elsewhere in the literature. To survey these datasets, we use the recommendations of the Complex 2.0 Framework (Shardlow et al., 2022), identifying how the datasets differ along the following dimensions: annotation scale, context, multiple token instances, multiple token annotations, diverse annotators. We conclude with future research challenges arising from our survey of existing lexical complexity prediction datasets.

pdf
Rater Cohesion and Quality from a Vicarious Perspective
Deepak Pandita | Tharindu Cyril Weerasooriya | Sujan Dutta | Sarah K. K. Luger | Tharindu Ranasinghe | Ashiqur R. KhudaBukhsh | Marcos Zampieri | Christopher M Homan
Findings of the Association for Computational Linguistics: EMNLP 2024

Human feedback is essential for building human-centered AI systems across domains where disagreement is prevalent, such as AI safety, content moderation, or sentiment analysis. Many disagreements, particularly in politically charged settings, arise because raters have opposing values or beliefs. Vicarious annotation is a method for breaking down disagreement by asking raters how they think others would annotate the data. In this paper, we explore the use of vicarious annotation with analytical methods for moderating rater disagreement. We employ rater cohesion metrics to study the potential influence of political affiliations and demographic backgrounds on raters’ perceptions of offense. Additionally, we utilize CrowdTruth’s rater quality metrics, which consider the demographics of the raters, to score the raters and their annotations. We study how the rater quality metrics influence the in-group and cross-group rater cohesion across the personal and vicarious levels.

pdf bib
Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)
Yves Scherrer | Tommi Jauhiainen | Nikola Ljubešić | Marcos Zampieri | Preslav Nakov | Jörg Tiedemann
Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)

pdf
Native Language Identification in Texts: A Survey
Dhiman Goswami | Sharanya Thilagan | Kai North | Shervin Malmasi | Marcos Zampieri
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

We present the first comprehensive survey of Native Language Identification (NLI) applied to texts. NLI is the task of automatically identifying an author’s native language (L1) based on their second language (L2) production. NLI is an important task with practical applications in second language teaching and NLP. The task has been widely studied for both text and speech, particularly for L2 English due to the availability of suitable corpora. Speech-based NLI relies heavily on accent modeled by pronunciation patterns and prosodic cues while text-based NLI relies primarily on modeling spelling errors and grammatical patterns that reveal properties of an individuals’ L1 influencing L2 production. We survey over one hundred papers on the topic including the papers associated with the NLI and INLI shared tasks. We describe several text representations and computational techniques used in text-based NLI. Finally, we present a comprehensive account of publicly available datasets used for the task thus far.

pdf bib
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
Yang (Trista) Cao | Isabel Papadimitriou | Anaelia Ovalle | Marcos Zampieri | Francis Ferraro | Swabha Swayamdipta
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)

pdf
MasonPerplexity at Multimodal Hate Speech Event Detection 2024: Hate Speech and Target Detection Using Transformer Ensembles
Amrita Ganguly | Al Nahian Bin Emran | Sadiya Sayara Chowdhury Puspo | Md Nishat Raihan | Dhiman Goswami | Marcos Zampieri
Proceedings of the 7th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2024)

The automatic identification of offensive language such as hate speech is important to keep discussions civil in online communities. Identifying hate speech in multimodal content is a particularly challenging task because offensiveness can be manifested in either words or images or a juxtaposition of the two. This paper presents the MasonPerplexity submission for the Shared Task on Multimodal Hate Speech Event Detection at CASE 2024 at EACL 2024. The task is divided into two sub-tasks: sub-task A focuses on the identification of hate speech and sub-task B focuses on the identification of targets in text-embedded images during political events. We use an XLM-roBERTa-large model for sub-task A and an ensemble approach combining XLM-roBERTa-base, BERTweet-large, and BERT-base for sub-task B. Our approach obtained 0.8347 F1-score in sub-task A and 0.6741 F1-score in sub-task B ranking 3rd on both sub-tasks.

pdf
The BEA 2024 Shared Task on the Multilingual Lexical Simplification Pipeline
Matthew Shardlow | Fernando Alva-Manchego | Riza Batista-Navarro | Stefan Bott | Saul Calderon Ramirez | Rémi Cardon | Thomas François | Akio Hayakawa | Andrea Horbach | Anna Hülsing | Yusuke Ide | Joseph Marvin Imperial | Adam Nohejl | Kai North | Laura Occhipinti | Nelson Peréz Rojas | Nishat Raihan | Tharindu Ranasinghe | Martin Solis Salazar | Sanja Štajner | Marcos Zampieri | Horacio Saggion
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

We report the findings of the 2024 Multilingual Lexical Simplification Pipeline shared task. We released a new dataset comprising 5,927 instances of lexical complexity prediction and lexical simplification on common contexts across 10 languages, split into trial (300) and test (5,627). 10 teams participated across 2 tracks and 10 languages with 233 runs evaluated across all systems. Five teams participated in all languages for the lexical complexity prediction task and 4 teams participated in all languages for the lexical simplification task. Teams employed a range of strategies, making use of open and closed source large language models for lexical simplification, as well as feature-based approaches for lexical complexity prediction. The highest scoring team on the combined multilingual data was able to obtain a Pearson’s correlation of 0.6241 and an ACC@1@Top1 of 0.3772, both demonstrating that there is still room for improvement on two difficult sub-tasks of the lexical simplification pipeline.

pdf
GMU at MLSP 2024: Multilingual Lexical Simplification with Transformer Models
Dhiman Goswami | Kai North | Marcos Zampieri
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

This paper presents GMU’s submission to the Multilingual Lexical Simplification Pipeline (MLSP) shared task at the BEA workshop 2024. The task includes Lexical Complexity Prediction (LCP) and Lexical Simplification (LS) sub-tasks across 10 languages. Our submissions achieved rankings ranging from 1st to 5th in LCP and from 1st to 3rd in LS. Our best performing approach for LCP is a weighted ensemble based on Pearson correlation of language specific transformer models trained on all languages combined. For LS, GPT4-turbo zero-shot prompting achieved the best performance.

pdf bib
EmoMix-3L: A Code-Mixed Dataset for Bangla-English-Hindi for Emotion Detection
Nishat Raihan | Dhiman Goswami | Antara Mahmud | Antonios Anastasopoulos | Marcos Zampieri
Proceedings of the 7th Workshop on Indian Language Data: Resources and Evaluation

Code-mixing is a well-studied linguistic phenomenon that occurs when two or more languages are mixed in text or speech. Several studies have been conducted on building datasets and performing downstream NLP tasks on code-mixed data. Although it is not uncommon to observe code-mixing of three or more languages, most available datasets in this domain contain code-mixed data from only two languages. In this paper, we introduce EmoMix-3L, a novel multi-label emotion detection dataset containing code-mixed data from three different languages. We experiment with several models on EmoMix-3L and we report that MuRIL outperforms other models on this dataset.

pdf
Language Variety Identification with True Labels
Marcos Zampieri | Kai North | Tommi Jauhiainen | Mariano Felice | Neha Kumari | Nishant Nair | Yash Mahesh Bangera
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Language identification is an important first step in many NLP applications. Most publicly available language identification datasets, however, are compiled under the assumption that the gold label of each instance is determined by where texts are retrieved from. Research has shown that this is a problematic assumption, particularly in the case of very similar languages (e.g., Croatian and Serbian) and national language varieties (e.g., Brazilian and European Portuguese), where texts may contain no distinctive marker of the particular language or variety. To overcome this important limitation, this paper presents DSL True Labels (DSL-TL), the first human-annotated multilingual dataset for language variety identification. DSL-TL contains a total of 12,900 instances in Portuguese, split between European Portuguese and Brazilian Portuguese; Spanish, split between Argentine Spanish and Castilian Spanish; and English, split between American English and British English. We trained multiple models to discriminate between these language varieties, and we present the results in detail. The data and models presented in this paper provide a reliable benchmark toward the development of robust and fairer language variety identification systems. We make DSL-TL freely available to the research community.

pdf
MentalHelp: A Multi-Task Dataset for Mental Health in Social Media
Nishat Raihan | Sadiya Sayara Chowdhury Puspo | Shafkat Farabi | Ana-Maria Bucur | Tharindu Ranasinghe | Marcos Zampieri
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Early detection of mental health disorders is an essential step in treating and preventing mental health conditions. Computational approaches have been applied to users’ social media profiles in an attempt to identify various mental health conditions such as depression, PTSD, schizophrenia, and eating disorders. The interest in this topic has motivated the creation of various depression detection datasets. However, annotating such datasets is expensive and time-consuming, limiting their size and scope. To overcome this limitation, we present MentalHelp, a large-scale semi-supervised mental disorder detection dataset containing 14 million instances. The corpus was collected from Reddit and labeled in a semi-supervised way using an ensemble of three separate models - flan-T5, Disor-BERT, and Mental-BERT.

pdf
MasonTigers at SemEval-2024 Task 9: Solving Puzzles with an Ensemble of Chain-of-Thought Prompts
Nishat Raihan | Dhiman Goswami | Al Nahian Bin Emran | Sadiya Sayara Chowdhury Puspo | Amrita Ganguly | Marcos Zampieri
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

Our paper presents team MasonTigers submission to the SemEval-2024 Task 9 - which provides a dataset of puzzles for testing natural language understanding. We employ large language models (LLMs) to solve this task through several prompting techniques. Zero-shot and few-shot prompting generate reasonably good results when tested with proprietary LLMs, compared to the open-source models. We obtain further improved results with chain-of-thought prompting, an iterative prompting method that breaks down the reasoning process step-by-step. We obtain our best results by utilizing an ensemble of chain-of-thought prompts, placing 2nd in the word puzzle subtask and 13th in the sentence puzzle subtask. The strong performance of prompted LLMs demonstrates their capability for complex reasoning when provided with a decomposition of the thought process. Our work sheds light on how step-wise explanatory prompts can unlock more of the knowledge encoded in the parameters of large models.

pdf
MasonTigers at SemEval-2024 Task 1: An Ensemble Approach for Semantic Textual Relatedness
Dhiman Goswami | Sadiya Sayara Chowdhury Puspo | Nishat Raihan | Al Nahian Bin Emran | Amrita Ganguly | Marcos Zampieri
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

This paper presents the MasonTigers’ entry to the SemEval-2024 Task 1 - Semantic Textual Relatedness. The task encompasses supervised (Track A), unsupervised (Track B), and cross-lingual (Track C) approaches to semantic textual relatedness across 14 languages. MasonTigers stands out as one of the two teams who participated in all languages across the three tracks. Our approaches achieved rankings ranging from 11th to 21st in Track A, from 1st to 8th in Track B, and from 5th to 12th in Track C. Adhering to the task-specific constraints, our best performing approaches utilize an ensemble of statistical machine learning approaches combined with language-specific BERT based models and sentence transformers.

2023

pdf
Target-Based Offensive Language Identification
Marcos Zampieri | Skye Morgan | Kai North | Tharindu Ranasinghe | Austin Simmmons | Paridhi Khandelwal | Sara Rosenthal | Preslav Nakov
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We present TBO, a new dataset for Target-based Offensive language identification. TBO contains post-level annotations regarding the harmfulness of an offensive post and token-level annotations comprising of the target and the offensive argument expression. Popular offensive language identification datasets for social media focus on annotation taxonomies only at the post level and more recently, some datasets have been released that feature only token-level annotations. TBO is an important resource that bridges the gap between post-level and token-level annotation datasets by introducing a single comprehensive unified annotation taxonomy. We use the TBO taxonomy to annotate post-level and token-level offensive language on English Twitter posts. We release an initial dataset of over 4,500 instances collected from Twitter and we carry out multiple experiments to compare the performance of different models trained and tested on TBO.

pdf
Teacher and Student Models of Offensive Language in Social Media
Tharindu Ranasinghe | Marcos Zampieri
Findings of the Association for Computational Linguistics: ACL 2023

State-of-the-art approaches to identifying offensive language online make use of large pre-trained transformer models. However, the inference time, disk, and memory requirements of these transformer models present challenges for their wide usage in the real world. Even the distilled transformer models remain prohibitively large for many usage scenarios. To cope with these challenges, in this paper, we propose transferring knowledge from transformer models to much smaller neural models to make predictions at the token- and at the post-level. We show that this approach leads to lightweight offensive language identification models that perform on par with large transformers but with 100 times fewer parameters and much less memory usage

pdf
A Text-to-Text Model for Multilingual Offensive Language Identification
Tharindu Ranasinghe | Marcos Zampieri
Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings)

pdf bib
Offensive Language Identification in Transliterated and Code-Mixed Bangla
Md Nishat Raihan | Umma Tanmoy | Anika Binte Islam | Kai North | Tharindu Ranasinghe | Antonios Anastasopoulos | Marcos Zampieri
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

Identifying offensive content in social media is vital to create safe online communities. Several recent studies have addressed this problem by creating datasets for various languages. In this paper, we explore offensive language identification in texts with transliterations and code-mixing, linguistic phenomena common in multilingual societies, and a known challenge for NLP systems. We introduce TB-OLID, a transliterated Bangla offensive language dataset containing 5,000 manually annotated comments. We train and fine-tune machine learning models on TB-OLID, and we evaluate their results on this dataset. Our results show that English pre-trained transformer-based models, such as fBERT and HateBERT achieve the best performance on this dataset.

pdf
nlpBDpatriots at BLP-2023 Task 1: Two-Step Classification for Violence Inciting Text Detection in Bangla - Leveraging Back-Translation and Multilinguality
Md Nishat Raihan | Dhiman Goswami | Sadiya Sayara Chowdhury Puspo | Marcos Zampieri
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

In this paper, we discuss the nlpBDpatriots entry to the shared task on Violence Inciting Text Detection (VITD) organized as part of the first workshop on Bangla Language Processing (BLP) co-located with EMNLP. The aim of this task is to identify and classify the violent threats, that provoke further unlawful violent acts. Our best-performing approach for the task is two-step classification using back translation and multilinguality which ranked 6th out of 27 teams with a macro F1 score of 0.74.

pdf
nlpBDpatriots at BLP-2023 Task 2: A Transfer Learning Approach towards Bangla Sentiment Analysis
Dhiman Goswami | Md Nishat Raihan | Sadiya Sayara Chowdhury Puspo | Marcos Zampieri
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

In this paper, we discuss the entry of nlpBDpatriots to some sophisticated approaches for classifying Bangla Sentiment Analysis. This is a shared task of the first workshop on Bangla Language Processing (BLP) organized under EMNLP. The main objective of this task is to identify the sentiment polarity of social media content. There are 30 groups of NLP enthusiasts who participate in this shared task and our best-performing approach for the task is transfer learning with data augmentation. Our group ranked 12th position in this competition with this methodology securing a micro F1 score of 0.71.

pdf
Vicarious Offense and Noise Audit of Offensive Speech Classifiers: Unifying Human and Machine Disagreement on What is Offensive
Tharindu Weerasooriya | Sujan Dutta | Tharindu Ranasinghe | Marcos Zampieri | Christopher Homan | Ashiqur KhudaBukhsh
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Offensive speech detection is a key component of content moderation. However, what is offensive can be highly subjective. This paper investigates how machine and human moderators disagree on what is offensive when it comes to real-world social web political discourse. We show that (1) there is extensive disagreement among the moderators (humans and machines); and (2) human and large-language-model classifiers are unable to predict how other human raters will respond, based on their political leanings. For (1), we conduct a ***noise audit*** at an unprecedented scale that combines both machine and human responses. For (2), we introduce a first-of-its-kind dataset of ***vicarious offense***. Our noise audit reveals that moderation outcomes vary wildly across different machine moderators. Our experiments with human moderators suggest that political leanings combined with sensitive issues affect both first-person and vicarious offense. The dataset is available through https://github.com/Homan-Lab/voiced.

pdf
Publish or Hold? Automatic Comment Moderation in Luxembourgish News Articles
Tharindu Ranasinghe | Alistair Plum | Christoph Purschke | Marcos Zampieri
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Recently, the internet has emerged as the primary platform for accessing news. In the majority of these news platforms, the users now have the ability to post comments on news articles and engage in discussions on various social media. While these features promote healthy conversations among users, they also serve as a breeding ground for spreading fake news, toxic discussions and hate speech. Moderating or removing such content is paramount to avoid unwanted consequences for the readers. How- ever, apart from a few notable exceptions, most research on automatic moderation of news article comments has dealt with English and other high resource languages. This leaves under-represented or low-resource languages at a loss. Addressing this gap, we perform the first large-scale qualitative analysis of more than one million Luxembourgish comments posted over the course of 14 years. We evaluate the performance of state-of-the-art transformer models in Luxembourgish news article comment moderation. Furthermore, we analyse how the language of Luxembourgish news article comments has changed over time. We observe that machine learning models trained on old comments do not perform well on recent data. The findings in this work will be beneficial in building news comment moderation systems for many low-resource languages

pdf
OffMix-3L: A Novel Code-Mixed Test Dataset in Bangla-English-Hindi for Offensive Language Identification
Dhiman Goswami | Md Nishat Raihan | Antara Mahmud | Antonios Anastasopoulos | Marcos Zampieri
Proceedings of the 11th International Workshop on Natural Language Processing for Social Media

pdf
SentMix-3L: A Novel Code-Mixed Test Dataset in Bangla-English-Hindi for Sentiment Analysis
Md Nishat Raihan | Dhiman Goswami | Antara Mahmud | Antonios Anastasopoulos | Marcos Zampieri
Proceedings of the First Workshop in South East Asian Language Processing

pdf
ALEXSIS+: Improving Substitute Generation and Selection for Lexical Simplification with Information Retrieval
Kai North | Alphaeus Dmonte | Tharindu Ranasinghe | Matthew Shardlow | Marcos Zampieri
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

Lexical simplification (LS) automatically replaces words that are deemed difficult to understand for a given target population with simpler alternatives, whilst preserving the meaning of the original sentence. The TSAR-2022 shared task on LS provided participants with a multilingual lexical simplification test set. It contained nearly 1,200 complex words in English, Portuguese, and Spanish and presented multiple candidate substitutions for each complex word. The competition did not make training data available; therefore, teams had to use either off-the-shelf pre-trained large language models (LLMs) or out-domain data to develop their LS systems. As such, participants were unable to fully explore the capabilities of LLMs by re-training and/or fine-tuning them on in-domain data. To address this important limitation, we present ALEXSIS+, a multilingual dataset in the aforementioned three languages, and ALEXSIS++, an English monolingual dataset that together contains more than 50,000 unique sentences retrieved from news corpora and annotated with cosine similarities to the original complex word and sentence. Using these additional contexts, we are able to generate new high-quality candidate substitutions that improve LS performance on the TSAR-2022 test set regardless of the language or model.

pdf bib
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)
Yves Scherrer | Tommi Jauhiainen | Nikola Ljubešić | Preslav Nakov | Jörg Tiedemann | Marcos Zampieri
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)

pdf
Findings of the VarDial Evaluation Campaign 2023
Noëmi Aepli | Çağrı Çöltekin | Rob Van Der Goot | Tommi Jauhiainen | Mourhaf Kazzaz | Nikola Ljubešić | Kai North | Barbara Plank | Yves Scherrer | Marcos Zampieri
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)

This report presents the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2023. The campaign is part of the tenth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with EACL 2023. Three separate shared tasks were included this year: Slot and intent detection for low-resource language varieties (SID4LR), Discriminating Between Similar Languages – True Labels (DSL-TL), and Discriminating Between Similar Languages – Speech (DSL-S). All three tasks were organized for the first time this year.

2022

pdf bib
Proceedings of the Third Workshop on Threat, Aggression and Cyberbullying (TRAC 2022)
Ritesh Kumar | Atul Kr. Ojha | Marcos Zampieri | Shervin Malmasi | Daniel Kadar
Proceedings of the Third Workshop on Threat, Aggression and Cyberbullying (TRAC 2022)

pdf bib
Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)
Sanja Štajner | Horacio Saggion | Daniel Ferrés | Matthew Shardlow | Kim Cheng Sheang | Kai North | Marcos Zampieri | Wei Xu
Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)

pdf
GMU-WLV at TSAR-2022 Shared Task: Evaluating Lexical Simplification Models
Kai North | Alphaeus Dmonte | Tharindu Ranasinghe | Marcos Zampieri
Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)

This paper describes team GMU-WLV submission to the TSAR shared-task on multilingual lexical simplification. The goal of the task is to automatically provide a set of candidate substitutions for complex words in context. The organizers provided participants with ALEXSIS a manually annotated dataset with instances split between a small trial set with a dozen instances in each of the three languages of the competition (English, Portuguese, Spanish) and a test set with over 300 instances in the three aforementioned languages. To cope with the lack of training data, participants had to either use alternative data sources or pre-trained language models. We experimented with monolingual models: BERTimbau, ELECTRA, and RoBERTA-largeBNE. Our best system achieved 1st place out of sixteen systems for Portuguese, 8th out of thirty-three systems for English, and 6th out of twelve systems for Spanish.

pdf
Findings of the TSAR-2022 Shared Task on Multilingual Lexical Simplification
Horacio Saggion | Sanja Štajner | Daniel Ferrés | Kim Cheng Sheang | Matthew Shardlow | Kai North | Marcos Zampieri
Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)

We report findings of the TSAR-2022 shared task on multilingual lexical simplification, organized as part of the Workshop on Text Simplification, Accessibility, and Readability TSAR-2022 held in conjunction with EMNLP 2022. The task called the Natural Language Processing research community to contribute with methods to advance the state of the art in multilingual lexical simplification for English, Portuguese, and Spanish. A total of 14 teams submitted the results of their lexical simplification systems for the provided test data. Results of the shared task indicate new benchmarks in Lexical Simplification with English lexical simplification quantitative results noticeably higher than those obtained for Spanish and (Brazilian) Portuguese.

pdf bib
Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects
Yves Scherrer | Tommi Jauhiainen | Nikola Ljubešić | Preslav Nakov | Jörg Tiedemann | Marcos Zampieri
Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects

pdf bib
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts
Luciana Benotti | Naoaki Okazaki | Yves Scherrer | Marcos Zampieri
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

pdf
Transfer Learning Methods for Domain Adaptation in Technical Logbook Datasets
Farhad Akhbardeh | Marcos Zampieri | Cecilia Ovesdotter Alm | Travis Desell
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Event identification in technical logbooks poses challenges given the limited logbook data available in specific technical domains, the large set of possible classes, and logbook entries typically being in short form and non-standard technical language. Technical logbook data typically has both a domain, the field it comes from (e.g., automotive), and an application, what it is used for (e.g., maintenance). In order to better handle the problem of data scarcity, using a variety of technical logbook datasets, this paper investigates the benefits of using transfer learning from sources within the same domain (but different applications), from within the same application (but different domains) and from all available data. Results show that performing transfer learning within a domain provides statistically significant improvements, and in all cases but one the best performance. Interestingly, transfer learning from within the application or across the global dataset degrades results in all cases but one, which benefited from adding as much data as possible. A further analysis of the dataset similarities shows that the datasets with higher similarity scores performed better in transfer learning tasks, suggesting that this can be utilized to determine the effectiveness of adding a dataset in a transfer learning task for technical logbooks.

pdf bib
Proceedings of the Seventh Conference on Machine Translation (WMT)
Philipp Koehn | Loïc Barrault | Ondřej Bojar | Fethi Bougares | Rajen Chatterjee | Marta R. Costa-jussà | Christian Federmann | Mark Fishel | Alexander Fraser | Markus Freitag | Yvette Graham | Roman Grundkiewicz | Paco Guzman | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Tom Kocmi | André Martins | Makoto Morishita | Christof Monz | Masaaki Nagata | Toshiaki Nakazawa | Matteo Negri | Aurélie Névéol | Mariana Neves | Martin Popel | Marco Turchi | Marcos Zampieri
Proceedings of the Seventh Conference on Machine Translation (WMT)

pdf
An Evaluation of Binary Comparative Lexical Complexity Models
Kai North | Marcos Zampieri | Matthew Shardlow
Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)

Identifying complex words in texts is an important first step in text simplification (TS) systems. In this paper, we investigate the performance of binary comparative Lexical Complexity Prediction (LCP) models applied to a popular benchmark dataset — the CompLex 2.0 dataset used in SemEval-2021 Task 1. With the data from CompLex 2.0, we create a new dataset contain 1,940 sentences referred to as CompLex-BC. Using CompLex-BC, we train multiple models to differentiate which of two target words is more or less complex in the same sentence. A linear SVM model achieved the best performance in our experiments with an F1-score of 0.86.

pdf
ALEXSIS-PT: A New Resource for Portuguese Lexical Simplification
Kai North | Marcos Zampieri | Tharindu Ranasinghe
Proceedings of the 29th International Conference on Computational Linguistics

Lexical simplification (LS) is the task of automatically replacing complex words for easier ones making texts more accessible to various target populations (e.g. individuals with low literacy, individuals with learning disabilities, second language learners). To train and test models, LS systems usually require corpora that feature complex words in context along with their potential substitutions. To continue improving the performance of LS systems we introduce ALEXSIS-PT, a novel multi-candidate dataset for Brazilian Portuguese LS containing 9,605 candidate substitutions for 387 complex words. ALEXSIS-PT has been compiled following the ALEXSIS-ES protocol for Spanish opening exciting new avenues for cross-lingual models. ALEXSIS-PT is the first LS multi-candidate dataset that contains Brazilian newspaper articles. We evaluated three models for substitute generation on this dataset, namely mBERT, XLM-R, and BERTimbau. The latter achieved the highest performance across all evaluation metrics.

2021

pdf bib
SemEval-2021 Task 1: Lexical Complexity Prediction
Matthew Shardlow | Richard Evans | Gustavo Henrique Paetzold | Marcos Zampieri
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This paper presents the results and main findings of SemEval-2021 Task 1 - Lexical Complexity Prediction. We provided participants with an augmented version of the CompLex Corpus (Shardlow et al. 2020). CompLex is an English multi-domain corpus in which words and multi-word expressions (MWEs) were annotated with respect to their complexity using a five point Likert scale. SemEval-2021 Task 1 featured two Sub-tasks: Sub-task 1 focused on single words and Sub-task 2 focused on MWEs. The competition attracted 198 teams in total, of which 54 teams submitted official runs on the test data to Sub-task 1 and 37 to Sub-task 2.

pdf
LCP-RIT at SemEval-2021 Task 1: Exploring Linguistic Features for Lexical Complexity Prediction
Abhinandan Tejalkumar Desai | Kai North | Marcos Zampieri | Christopher Homan
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This paper describes team LCP-RIT’s submission to the SemEval-2021 Task 1: Lexical Complexity Prediction (LCP). The task organizers provided participants with an augmented version of CompLex (Shardlow et al., 2020), an English multi-domain dataset in which words in context were annotated with respect to their complexity using a five point Likert scale. Our system uses logistic regression and a wide range of linguistic features (e.g. psycholinguistic features, n-grams, word frequency, POS tags) to predict the complexity of single words in this dataset. We analyze the impact of different linguistic features on the classification performance and we evaluate the results in terms of mean absolute error, mean squared error, Pearson correlation, and Spearman correlation.

pdf
WLV-RIT at SemEval-2021 Task 5: A Neural Transformer Framework for Detecting Toxic Spans
Tharindu Ranasinghe | Diptanu Sarkar | Marcos Zampieri | Alexander Ororbia
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

In recent years, the widespread use of social media has led to an increase in the generation of toxic and offensive content on online platforms. In response, social media platforms have worked on developing automatic detection methods and employing human moderators to cope with this deluge of offensive content. While various state-of-the-art statistical models have been applied to detect toxic posts, there are only a few studies that focus on detecting the words or expressions that make a post offensive. This motivates the organization of the SemEval-2021 Task 5: Toxic Spans Detection competition, which has provided participants with a dataset containing toxic spans annotation in English posts. In this paper, we present the WLV-RIT entry for the SemEval-2021 Task 5. Our best performing neural transformer model achieves an 0.68 F1-Score. Furthermore, we develop an open-source framework for multilingual detection of offensive spans, i.e., MUDES, based on neural transformers that detect toxic spans in texts.

pdf
Handling Extreme Class Imbalance in Technical Logbook Datasets
Farhad Akhbardeh | Cecilia Ovesdotter Alm | Marcos Zampieri | Travis Desell
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Technical logbooks are a challenging and under-explored text type in automated event identification. These texts are typically short and written in non-standard yet technical language, posing challenges to off-the-shelf NLP pipelines. The granularity of issue types described in these datasets additionally leads to class imbalance, making it challenging for models to accurately predict which issue each logbook entry describes. In this paper we focus on the problem of technical issue classification by considering logbook datasets from the automotive, aviation, and facilities maintenance domains. We adapt a feedback strategy from computer vision for handling extreme class imbalance, which resamples the training data based on its error in the prediction process. Our experiments show that with statistical significance this feedback strategy provides the best results for four different neural network models trained across a suite of seven different technical logbook datasets from distinct technical domains. The feedback strategy is also generic and could be applied to any learning problem with substantial class imbalances.

pdf
Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi
Saurabh Sampatrao Gaikwad | Tharindu Ranasinghe | Marcos Zampieri | Christopher Homan
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

The widespread presence of offensive language on social media motivated the development of systems capable of recognizing such content automatically. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English. To address this shortcoming, we introduce MOLD, the Marathi Offensive Language Dataset. MOLD is the first dataset of its kind compiled for Marathi, thus opening a new domain for research in low-resource Indo-Aryan languages. We present results from several machine learning experiments on this dataset, including zero-short and other transfer learning experiments on state-of-the-art cross-lingual transformers from existing data in Bengali, English, and Hindi.

pdf bib
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects
Marcos Zampieri | Preslav Nakov | Nikola Ljubešić | Jörg Tiedemann | Yves Scherrer | Tommi Jauhiainen
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects

pdf bib
Findings of the VarDial Evaluation Campaign 2021
Bharathi Raja Chakravarthi | Gaman Mihaela | Radu Tudor Ionescu | Heidi Jauhiainen | Tommi Jauhiainen | Krister Lindén | Nikola Ljubešić | Niko Partanen | Ruba Priyadharshini | Christoph Purschke | Eswari Rajagopal | Yves Scherrer | Marcos Zampieri
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects

This paper describes the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2021. The campaign was part of the eighth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with EACL 2021. Four separate shared tasks were included this year: Dravidian Language Identification (DLI), Romanian Dialect Identification (RDI), Social Media Variety Geolocation (SMG), and Uralic Language Identification (ULI). DLI was organized for the first time and the other three continued a series of tasks from previous evaluation campaigns.

pdf
Comparing Approaches to Dravidian Language Identification
Tommi Jauhiainen | Tharindu Ranasinghe | Marcos Zampieri
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects

This paper describes the submissions by team HWR to the Dravidian Language Identification (DLI) shared task organized at VarDial 2021 workshop. The DLI training set includes 16,674 YouTube comments written in Roman script containing code-mixed text with English and one of the three South Dravidian languages: Kannada, Malayalam, and Tamil. We submitted results generated using two models, a Naive Bayes classifier with adaptive language models, which has shown to obtain competitive performance in many language and dialect identification tasks, and a transformer-based model which is widely regarded as the state-of-the-art in a number of NLP tasks. Our first submission was sent in the closed submission track using only the training set provided by the shared task organisers, whereas the second submission is considered to be open as it used a pretrained model trained with external data. Our team attained shared second position in the shared task with the submission based on Naive Bayes. Our results reinforce the idea that deep learning methods are not as competitive in language identification related tasks as they are in many other text classification tasks.

pdf
MUDES: Multilingual Detection of Offensive Spans
Tharindu Ranasinghe | Marcos Zampieri
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations

The interest in offensive content identification in social media has grown substantially in recent years. Previous work has dealt mostly with post level annotations. However, identifying offensive spans is useful in many ways. To help coping with this important challenge, we present MUDES, a multilingual system to detect offensive spans in texts. MUDES features pre-trained models, a Python API for developers, and a user-friendly web-based interface. A detailed description of MUDES’ components is presented in this paper.

pdf
WLV-RIT at GermEval 2021: Multitask Learning with Transformers to Detect Toxic, Engaging, and Fact-Claiming Comments
Skye Morgan | Tharindu Ranasinghe | Marcos Zampieri
Proceedings of the GermEval 2021 Shared Task on the Identification of Toxic, Engaging, and Fact-Claiming Comments

This paper addresses the identification of toxic, engaging, and fact-claiming comments on social media. We used the dataset made available by the organizers of the GermEval2021 shared task containing over 3,000 manually annotated Facebook comments in German. Considering the relatedness of the three tasks, we approached the problem using large pre-trained transformer models and multitask learning. Our results indicate that multitask learning achieves performance superior to the more common single task learning approach in all three tasks. We submit our best systems to GermEval-2021 under the team name WLV-RIT.

pdf bib
Findings of the 2021 Conference on Machine Translation (WMT21)
Farhad Akhbardeh | Arkady Arkhangorodsky | Magdalena Biesialska | Ondřej Bojar | Rajen Chatterjee | Vishrav Chaudhary | Marta R. Costa-jussa | Cristina España-Bonet | Angela Fan | Christian Federmann | Markus Freitag | Yvette Graham | Roman Grundkiewicz | Barry Haddow | Leonie Harter | Kenneth Heafield | Christopher Homan | Matthias Huck | Kwabena Amponsah-Kaakyire | Jungo Kasai | Daniel Khashabi | Kevin Knight | Tom Kocmi | Philipp Koehn | Nicholas Lourie | Christof Monz | Makoto Morishita | Masaaki Nagata | Ajay Nagesh | Toshiaki Nakazawa | Matteo Negri | Santanu Pal | Allahsera Auguste Tapo | Marco Turchi | Valentin Vydrin | Marcos Zampieri
Proceedings of the Sixth Conference on Machine Translation

This paper presents the results of the newstranslation task, the multilingual low-resourcetranslation for Indo-European languages, thetriangular translation task, and the automaticpost-editing task organised as part of the Con-ference on Machine Translation (WMT) 2021.In the news task, participants were asked tobuild machine translation systems for any of10 language pairs, to be evaluated on test setsconsisting mainly of news stories. The taskwas also opened up to additional test suites toprobe specific aspects of translation.

pdf
SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language Identification
Sara Rosenthal | Pepa Atanasova | Georgi Karadzhov | Marcos Zampieri | Preslav Nakov
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf
An Exploratory Analysis of the Relation between Offensive Language and Mental Health
Ana-Maria Bucur | Marcos Zampieri | Liviu P. Dinu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf
fBERT: A Neural Transformer for Identifying Offensive Content
Diptanu Sarkar | Marcos Zampieri | Tharindu Ranasinghe | Alexander Ororbia
Findings of the Association for Computational Linguistics: EMNLP 2021

Transformer-based models such as BERT, XLNET, and XLM-R have achieved state-of-the-art performance across various NLP tasks including the identification of offensive language and hate speech, an important problem in social media. In this paper, we present fBERT, a BERT model retrained on SOLID, the largest English offensive language identification corpus available with over 1.4 million offensive instances. We evaluate fBERT’s performance on identifying offensive content on multiple English datasets and we test several thresholds for selecting instances from SOLID. The fBERT model will be made freely available to the community.

pdf
A Computational Exploration of Pejorative Language in Social Media
Liviu P. Dinu | Ioan-Bogdan Iordache | Ana Sabina Uban | Marcos Zampieri
Findings of the Association for Computational Linguistics: EMNLP 2021

In this paper we study pejorative language, an under-explored topic in computational linguistics. Unlike existing models of offensive language and hate speech, pejorative language manifests itself primarily at the lexical level, and describes a word that is used with a negative connotation, making it different from offensive language or other more studied categories. Pejorativity is also context-dependent: the same word can be used with or without pejorative connotations, thus pejorativity detection is essentially a problem similar to word sense disambiguation. We leverage online dictionaries to build a multilingual lexicon of pejorative terms for English, Spanish, Italian, and Romanian. We additionally release a dataset of tweets annotated for pejorative use. Based on these resources, we present an analysis of the usage and occurrence of pejorative words in social media, and present an attempt to automatically disambiguate pejorative usage in our dataset.

2020

pdf bib
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects
Marcos Zampieri | Preslav Nakov | Nikola Ljubešić | Jörg Tiedemann | Yves Scherrer
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

pdf bib
A Report on the VarDial Evaluation Campaign 2020
Mihaela Gaman | Dirk Hovy | Radu Tudor Ionescu | Heidi Jauhiainen | Tommi Jauhiainen | Krister Lindén | Nikola Ljubešić | Niko Partanen | Christoph Purschke | Yves Scherrer | Marcos Zampieri
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

This paper presents the results of the VarDial Evaluation Campaign 2020 organized as part of the seventh workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with COLING 2020. The campaign included three shared tasks each focusing on a different challenge of language and dialect identification: Romanian Dialect Identification (RDI), Social Media Variety Geolocation (SMG), and Uralic Language Identification (ULI). The campaign attracted 30 teams who enrolled to participate in one or multiple shared tasks and 14 of them submitted runs across the three shared tasks. Finally, 11 papers describing participating systems are published in the VarDial proceedings and referred to in this report.

pdf bib
MaintNet: A Collaborative Open-Source Library for Predictive Maintenance Language Resources
Farhad Akhbardeh | Travis Desell | Marcos Zampieri
Proceedings of the 28th International Conference on Computational Linguistics: System Demonstrations

Maintenance record logbooks are an emerging text type in NLP. An important part of them typically consist of free text with many domain specific technical terms, abbreviations, and non-standard spelling and grammar. This poses difficulties for NLP pipelines trained on standard corpora. Analyzing and annotating such documents is of particular importance in the development of predictive maintenance systems, which aim to improve operational efficiency, reduce costs, prevent accidents, and save lives. In order to facilitate and encourage research in this area, we have developed MaintNet, a collaborative open-source library of technical and domain-specific language resources. MaintNet provides novel logbook data from the aviation, automotive, and facility maintenance domains along with tools to aid in their (pre-)processing and clustering. Furthermore, it provides a way to encourage discussion on and sharing of new datasets and tools for logbook data analysis.

pdf
NLP Tools for Predictive Maintenance Records in MaintNet
Farhad Akhbardeh | Travis Desell | Marcos Zampieri
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: System Demonstrations

Processing maintenance logbook records is an important step in the development of predictive maintenance systems. Logbooks often include free text fields with domain specific terms, abbreviations, and non-standard spelling posing challenges to off-the-shelf NLP pipelines trained on standard contemporary corpora. Despite the importance of this data type, processing predictive maintenance data is still an under-explored topic in NLP. With the goal of providing more datasets and resources to the community, in this paper we present a number of new resources available in MaintNet, a collaborative open-source library and data repository of predictive maintenance language datasets. We describe novel annotated datasets from multiple domains such as aviation, automotive, and facility maintenance domains and new tools for segmentation, spell checking, POS tagging, clustering, and classification.

pdf bib
Findings of the 2020 Conference on Machine Translation (WMT20)
Loïc Barrault | Magdalena Biesialska | Ondřej Bojar | Marta R. Costa-jussà | Christian Federmann | Yvette Graham | Roman Grundkiewicz | Barry Haddow | Matthias Huck | Eric Joanis | Tom Kocmi | Philipp Koehn | Chi-kiu Lo | Nikola Ljubešić | Christof Monz | Makoto Morishita | Masaaki Nagata | Toshiaki Nakazawa | Santanu Pal | Matt Post | Marcos Zampieri
Proceedings of the Fifth Conference on Machine Translation

This paper presents the results of the news translation task and the similar language translation task, both organised alongside the Conference on Machine Translation (WMT) 2020. In the news task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting mainly of news stories. The task was also opened up to additional test suites to probe specific aspects of translation. In the similar language translation task, participants built machine translation systems for translating between closely related pairs of languages.

pdf
Neural Machine Translation for Similar Languages: The Case of Indo-Aryan Languages
Santanu Pal | Marcos Zampieri
Proceedings of the Fifth Conference on Machine Translation

In this paper we present the WIPRO-RIT systems submitted to the Similar Language Translation shared task at WMT 2020. The second edition of this shared task featured parallel data from pairs/groups of similar languages from three different language families: Indo-Aryan languages (Hindi and Marathi), Romance languages (Catalan, Portuguese, and Spanish), and South Slavic Languages (Croatian, Serbian, and Slovene). We report the results obtained by our systems in translating from Hindi to Marathi and from Marathi to Hindi. WIPRO-RIT achieved competitive performance ranking 1st in Marathi to Hindi and 2nd in Hindi to Marathi translation among 22 systems.

pdf
Neural Machine Translation for Extremely Low-Resource African Languages: A Case Study on Bambara
Allahsera Auguste Tapo | Bakary Coulibaly | Sébastien Diarra | Christopher Homan | Julia Kreutzer | Sarah Luger | Arthur Nagashima | Marcos Zampieri | Michael Leventhal
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages

Low-resource languages present unique challenges to (neural) machine translation. We discuss the case of Bambara, a Mande language for which training data is scarce and requires significant amounts of pre-processing. More than the linguistic situation of Bambara itself, the socio-cultural context within which Bambara speakers live poses challenges for automated processing of this language. In this paper, we present the first parallel data set for machine translation of Bambara into and from English and French and the first benchmark results on machine translation to and from Bambara. We discuss challenges in working with low-resource languages and propose strategies to cope with data scarcity in low-resource machine translation (MT).

pdf
SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)
Marcos Zampieri | Preslav Nakov | Sara Rosenthal | Pepa Atanasova | Georgi Karadzhov | Hamdy Mubarak | Leon Derczynski | Zeses Pitenis | Çağrı Çöltekin
Proceedings of the Fourteenth Workshop on Semantic Evaluation

We present the results and the main findings of SemEval-2020 Task 12 on Multilingual Offensive Language Identification in Social Media (OffensEval-2020). The task included three subtasks corresponding to the hierarchical taxonomy of the OLID schema from OffensEval-2019, and it was offered in five languages: Arabic, Danish, English, Greek, and Turkish. OffensEval-2020 was one of the most popular tasks at SemEval-2020, attracting a large number of participants across all subtasks and languages: a total of 528 teams signed up to participate in the task, 145 teams submitted official runs on the test data, and 70 teams submitted system description papers.

pdf bib
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying
Ritesh Kumar | Atul Kr. Ojha | Bornini Lahiri | Marcos Zampieri | Shervin Malmasi | Vanessa Murdock | Daniel Kadar
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying

pdf bib
Evaluating Aggression Identification in Social Media
Ritesh Kumar | Atul Kr. Ojha | Shervin Malmasi | Marcos Zampieri
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying

In this paper, we present the report and findings of the Shared Task on Aggression and Gendered Aggression Identification organised as part of the Second Workshop on Trolling, Aggression and Cyberbullying (TRAC - 2) at LREC 2020. The task consisted of two sub-tasks - aggression identification (sub-task A) and gendered identification (sub-task B) - in three languages - Bangla, Hindi and English. For this task, the participants were provided with a dataset of approximately 5,000 instances from YouTube comments in each language. For testing, approximately 1,000 instances were provided in each language for each sub-task. A total of 70 teams registered to participate in the task and 19 teams submitted their test runs. The best system obtained a weighted F-score of approximately 0.80 in sub-task A for all the three languages. While approximately 0.87 in sub-task B for all the three languages.

pdf
Offensive Language Identification in Greek
Zesis Pitenis | Marcos Zampieri | Tharindu Ranasinghe
Proceedings of the Twelfth Language Resources and Evaluation Conference

As offensive language has become a rising issue for online communities and social media platforms, researchers have been investigating ways of coping with abusive content and developing systems to detect its different types: cyberbullying, hate speech, aggression, etc. With a few notable exceptions, most research on this topic so far has dealt with English. This is mostly due to the availability of language resources for English. To address this shortcoming, this paper presents the first Greek annotated dataset for offensive language identification: the Offensive Greek Tweet Dataset (OGTD). OGTD is a manually annotated dataset containing 4,779 posts from Twitter annotated as offensive and not offensive. Along with a detailed description of the dataset, we evaluate several computational models trained and tested on this data.

pdf
Multilingual Offensive Language Identification with Cross-lingual Embeddings
Tharindu Ranasinghe | Marcos Zampieri
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g. hate speech, cyberbulling, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain English data. In this paper, we take advantage of English data available by applying cross-lingual contextual word embeddings and transfer learning to make predictions in languages with less resources. We project predictions on comparable data in Bengali, Hindi, and Spanish and we report results of 0.8415 F1 macro for Bengali, 0.8568 F1 macro for Hindi, and 0.7513 F1 macro for Spanish. Finally, we show that our approach compares favorably to the best systems submitted to recent shared tasks on these three languages, confirming the robustness of cross-lingual contextual embeddings and transfer learning for this task.

pdf
CompLex — A New Corpus for Lexical Complexity Prediction from Likert Scale Data
Matthew Shardlow | Michael Cooper | Marcos Zampieri
Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI)

Predicting which words are considered hard to understand for a given target population is a vital step in many NLP applications such astext simplification. This task is commonly referred to as Complex Word Identification (CWI). With a few exceptions, previous studieshave approached the task as a binary classification task in which systems predict a complexity value (complex vs. non-complex) fora set of target words in a text. This choice is motivated by the fact that all CWI datasets compiled so far have been annotated using abinary annotation scheme. Our paper addresses this limitation by presenting the first English dataset for continuous lexical complexityprediction. We use a 5-point Likert scale scheme to annotate complex words in texts from three sources/domains: the Bible, Europarl,and biomedical texts. This resulted in a corpus of 9,476 sentences each annotated by around 7 annotators.

2019

pdf
SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval)
Marcos Zampieri | Shervin Malmasi | Preslav Nakov | Sara Rosenthal | Noura Farra | Ritesh Kumar
Proceedings of the 13th International Workshop on Semantic Evaluation

We present the results and the main findings of SemEval-2019 Task 6 on Identifying and Categorizing Offensive Language in Social Media (OffensEval). The task was based on a new dataset, the Offensive Language Identification Dataset (OLID), which contains over 14,000 English tweets, and it featured three sub-tasks. In sub-task A, systems were asked to discriminate between offensive and non-offensive posts. In sub-task B, systems had to identify the type of offensive content in the post. Finally, in sub-task C, systems had to detect the target of the offensive posts. OffensEval attracted a large number of participants and it was one of the most popular tasks in SemEval-2019. In total, nearly 800 teams signed up to participate in the task and 115 of them submitted results, which are presented and analyzed in this report.

pdf
UTFPR at SemEval-2019 Task 5: Hate Speech Identification with Recurrent Neural Networks
Gustavo Henrique Paetzold | Marcos Zampieri | Shervin Malmasi
Proceedings of the 13th International Workshop on Semantic Evaluation

In this paper we revisit the problem of automatically identifying hate speech in posts from social media. We approach the task using a system based on minimalistic compositional Recurrent Neural Networks (RNN). We tested our approach on the SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter (HatEval) shared task dataset. The dataset made available by the HatEval organizers contained English and Spanish posts retrieved from Twitter annotated with respect to the presence of hateful content and its target. In this paper we present the results obtained by our system in comparison to the other entries in the shared task. Our system achieved competitive performance ranking 7th in sub-task A out of 62 systems in the English track.

pdf
Predicting the Type and Target of Offensive Posts in Social Media
Marcos Zampieri | Shervin Malmasi | Preslav Nakov | Sara Rosenthal | Noura Farra | Ritesh Kumar
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

As offensive content has become pervasive in social media, there has been much research in identifying potentially offensive messages. However, previous work on this topic did not consider the problem as a whole, but rather focused on detecting very specific types of offensive content, e.g., hate speech, cyberbulling, or cyber-aggression. In contrast, here we target several different kinds of offensive content. In particular, we model the task hierarchically, identifying the type and the target of offensive messages in social media. For this purpose, we complied the Offensive Language Identification Dataset (OLID), a new dataset with tweets annotated for offensive content using a fine-grained three-layer annotation scheme, which we make publicly available. We discuss the main similarities and differences between OLID and pre-existing datasets for hate speech identification, aggression detection, and similar tasks. We further experiment with and we compare the performance of different machine learning models on OLID.

pdf bib
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects
Marcos Zampieri | Preslav Nakov | Shervin Malmasi | Nikola Ljubešić | Jörg Tiedemann | Ahmed Ali
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects

pdf bib
A Report on the Third VarDial Evaluation Campaign
Marcos Zampieri | Shervin Malmasi | Yves Scherrer | Tanja Samardžić | Francis Tyers | Miikka Silfverberg | Natalia Klyueva | Tung-Le Pan | Chu-Ren Huang | Radu Tudor Ionescu | Andrei M. Butnaru | Tommi Jauhiainen
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects

In this paper, we present the findings of the Third VarDial Evaluation Campaign organized as part of the sixth edition of the workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with NAACL 2019. This year, the campaign included five shared tasks, including one task re-run – German Dialect Identification (GDI) – and four new tasks – Cross-lingual Morphological Analysis (CMA), Discriminating between Mainland and Taiwan variation of Mandarin Chinese (DMT), Moldavian vs. Romanian Cross-dialect Topic identification (MRC), and Cuneiform Language Identification (CLI). A total of 22 teams submitted runs across the five shared tasks. After the end of the competition, we received 14 system description papers, which are published in the VarDial workshop proceedings and referred to in this report.

pdf
Experiments in Cuneiform Language Identification
Gustavo Henrique Paetzold | Marcos Zampieri
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects

This paper presents methods to discriminate between languages and dialects written in Cuneiform script, one of the first writing systems in the world. We report the results obtained by the PZ team in the Cuneiform Language Identification (CLI) shared task organized within the scope of the VarDial Evaluation Campaign 2019. The task included two languages, Sumerian and Akkadian. The latter is divided into six dialects: Old Babylonian, Middle Babylonian peripheral, Standard Babylonian, Neo Babylonian, Late Babylonian, and Neo Assyrian. We approach the task using a meta-classifier trained on various SVM models and we show the effectiveness of the system for this task. Our submission achieved 0.738 F1 score in discriminating between the seven languages and dialects and it was ranked fourth in the competition among eight teams.

pdf bib
Findings of the 2019 Conference on Machine Translation (WMT19)
Loïc Barrault | Ondřej Bojar | Marta R. Costa-jussà | Christian Federmann | Mark Fishel | Yvette Graham | Barry Haddow | Matthias Huck | Philipp Koehn | Shervin Malmasi | Christof Monz | Mathias Müller | Santanu Pal | Matt Post | Marcos Zampieri
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019. Participants were asked to build machine translation systems for any of 18 language pairs, to be evaluated on a test set of news stories. The main metric for this task is human judgment of translation quality. The task was also opened up to additional test suites to probe specific aspects of translation.

pdf
UDSDFKI Submission to the WMT2019 Czech–Polish Similar Language Translation Shared Task
Santanu Pal | Marcos Zampieri | Josef van Genabith
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

In this paper we present the UDS-DFKI system submitted to the Similar Language Translation shared task at WMT 2019. The first edition of this shared task featured data from three pairs of similar languages: Czech and Polish, Hindi and Nepali, and Portuguese and Spanish. Participants could choose to participate in any of these three tracks and submit system outputs in any translation direction. We report the results obtained by our system in translating from Czech to Polish and comment on the impact of out-of-domain test data in the performance of our system. UDS-DFKI achieved competitive performance ranking second among ten teams in Czech to Polish translation.

pdf bib
Improving CAT Tools in the Translation Workflow: New Approaches and Evaluation
Mihaela Vela | Santanu Pal | Marcos Zampieri | Sudip Naskar | Josef van Genabith
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

2018

pdf
Classifying Patent Applications with Ensemble Methods
Fernando Benites | Shervin Malmasi | Marcos Zampieri
Proceedings of the Australasian Language Technology Association Workshop 2018

We present methods for the automatic classification of patent applications using an annotated dataset provided by the organizers of the ALTA 2018 shared task - Classifying Patent Applications. The goal of the task is to use computational methods to categorize patent applications according to a coarse-grained taxonomy of eight classes based on the International Patent Classification (IPC). We tested a variety of approaches for this task and the best results, 0.778 micro-averaged F1-Score, were achieved by SVM ensembles using a combination of words and characters as features. Our team, BMZ, was ranked first among 14 teams in the competition.

pdf
LIdioms: A Multilingual Linked Idioms Data Set
Diego Moussallem | Mohamed Ahmed Sherif | Diego Esteves | Marcos Zampieri | Axel-Cyrille Ngonga Ngomo
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
RDF2PT: Generating Brazilian Portuguese Texts from RDF Data
Diego Moussallem | Thiago Ferreira | Marcos Zampieri | Maria Claudia Cavalcanti | Geraldo Xexéo | Mariana Neves | Axel-Cyrille Ngonga Ngomo
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
A Report on the Complex Word Identification Shared Task 2018
Seid Muhie Yimam | Chris Biemann | Shervin Malmasi | Gustavo Paetzold | Lucia Specia | Sanja Štajner | Anaïs Tack | Marcos Zampieri
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

We report the findings of the second Complex Word Identification (CWI) shared task organized as part of the BEA workshop co-located with NAACL-HLT’2018. The second CWI shared task featured multilingual and multi-genre datasets divided into four tracks: English monolingual, German monolingual, Spanish monolingual, and a multilingual track with a French test set, and two tasks: binary classification and probabilistic classification. A total of 12 teams submitted their results in different task/track combinations and 11 of them wrote system description papers that are referred to in this report and appear in the BEA workshop proceedings.

pdf
A Portuguese Native Language Identification Dataset
Iria del Río Gayo | Marcos Zampieri | Shervin Malmasi
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

In this paper we present NLI-PT, the first Portuguese dataset compiled for Native Language Identification (NLI), the task of identifying an author’s first language based on their second language writing. The dataset includes 1,868 student essays written by learners of European Portuguese, native speakers of the following L1s: Chinese, English, Spanish, German, Russian, French, Japanese, Italian, Dutch, Tetum, Arabic, Polish, Korean, Romanian, and Swedish. NLI-PT includes the original student text and four different types of annotation: POS, fine-grained POS, constituency parses, and dependency parses. NLI-PT can be used not only in NLI but also in research on several topics in the field of Second Language Acquisition and educational NLP. We discuss possible applications of this dataset and present the results obtained for the first lexical baseline system for Portuguese NLI.

pdf bib
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)
Marcos Zampieri | Preslav Nakov | Nikola Ljubešić | Jörg Tiedemann | Shervin Malmasi | Ahmed Ali
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

pdf bib
Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign
Marcos Zampieri | Shervin Malmasi | Preslav Nakov | Ahmed Ali | Suwon Shon | James Glass | Yves Scherrer | Tanja Samardžić | Nikola Ljubešić | Jörg Tiedemann | Chris van der Lee | Stefan Grondelaers | Nelleke Oostdijk | Dirk Speelman | Antal van den Bosch | Ritesh Kumar | Bornini Lahiri | Mayank Jain
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

We present the results and the findings of the Second VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects. The campaign was organized as part of the fifth edition of the VarDial workshop, collocated with COLING’2018. This year, the campaign included five shared tasks, including two task re-runs – Arabic Dialect Identification (ADI) and German Dialect Identification (GDI) –, and three new tasks – Morphosyntactic Tagging of Tweets (MTT), Discriminating between Dutch and Flemish in Subtitles (DFS), and Indo-Aryan Language Identification (ILI). A total of 24 teams submitted runs across the five shared tasks, and contributed 22 system description papers, which were included in the VarDial workshop proceedings and are referred to in this report.

pdf
Discriminating between Indo-Aryan Languages Using SVM Ensembles
Alina Maria Ciobanu | Marcos Zampieri | Shervin Malmasi | Santanu Pal | Liviu P. Dinu
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

In this paper we present a system based on SVM ensembles trained on characters and words to discriminate between five similar languages of the Indo-Aryan family: Hindi, Braj Bhasha, Awadhi, Bhojpuri, and Magahi. The system competed in the Indo-Aryan Language Identification (ILI) shared task organized within the VarDial Evaluation Campaign 2018. Our best entry in the competition, named ILIdentification, scored 88.95% F1 score and it was ranked 3rd out of 8 teams.

pdf
A Neural Approach to Language Variety Translation
Marta R. Costa-jussà | Marcos Zampieri | Santanu Pal
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

In this paper we present the first neural-based machine translation system trained to translate between standard national varieties of the same language. We take the pair Brazilian - European Portuguese as an example and compare the performance of this method to a phrase-based statistical machine translation system. We report a performance improvement of 0.9 BLEU points in translating from European to Brazilian Portuguese and 0.2 BLEU points when translating in the opposite direction. We also carried out a human evaluation experiment with native speakers of Brazilian Portuguese which indicates that humans prefer the output produced by the neural-based system in comparison to the statistical system.

pdf bib
Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)
Ritesh Kumar | Atul Kr. Ojha | Marcos Zampieri | Shervin Malmasi
Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)

pdf bib
Benchmarking Aggression Identification in Social Media
Ritesh Kumar | Atul Kr. Ojha | Shervin Malmasi | Marcos Zampieri
Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)

In this paper, we present the report and findings of the Shared Task on Aggression Identification organised as part of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC - 1) at COLING 2018. The task was to develop a classifier that could discriminate between Overtly Aggressive, Covertly Aggressive, and Non-aggressive texts. For this task, the participants were provided with a dataset of 15,000 aggression-annotated Facebook Posts and Comments each in Hindi (in both Roman and Devanagari script) and English for training and validation. For testing, two different sets - one from Facebook and another from a different social media - were provided. A total of 130 teams registered to participate in the task, 30 teams submitted their test runs, and finally 20 teams also sent their system description paper which are included in the TRAC workshop proceedings. The best system obtained a weighted F-score of 0.64 for both Hindi and English on the Facebook test sets, while the best scores on the surprise set were 0.60 and 0.50 for English and Hindi respectively. The results presented in this report depict how challenging the task is. The positive response from the community and the great levels of participation in the first edition of this shared task also highlights the interest in this topic.

2017

pdf
Detecting Hate Speech in Social Media
Shervin Malmasi | Marcos Zampieri
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

In this paper we examine methods to detect hate speech in social media, while distinguishing this from general profanity. We aim to establish lexical baselines for this task by applying supervised classification methods using a recently released dataset annotated for this purpose. As features, our system uses character n-grams, word n-grams and word skip-grams. We obtain results of 78% accuracy in identifying posts across three classes. Results demonstrate that the main challenge lies in discriminating profanity and hate speech from each other. A number of directions for future work are discussed.

pdf
Predicting the Law Area and Decisions of French Supreme Court Cases
Octavia-Maria Şulea | Marcos Zampieri | Mihaela Vela | Josef van Genabith
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

In this paper, we investigate the application of text classification methods to predict the law area and the decision of cases judged by the French Supreme Court. We also investigate the influence of the time period in which a ruling was made over the textual form of the case description and the extent to which it is necessary to mask the judge’s motivation for a ruling to emulate a real-world test scenario. We report results of 96% f1 score in predicting a case ruling, 90% f1 score in predicting the law area of a case, and 75.9% f1 score in estimating the time span when a ruling has been issued using a linear Support Vector Machine (SVM) classifier trained on lexical features.

pdf bib
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)
Preslav Nakov | Marcos Zampieri | Nikola Ljubešić | Jörg Tiedemann | Shevin Malmasi | Ahmed Ali
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

pdf bib
Findings of the VarDial Evaluation Campaign 2017
Marcos Zampieri | Shervin Malmasi | Nikola Ljubešić | Preslav Nakov | Ahmed Ali | Jörg Tiedemann | Yves Scherrer | Noëmi Aepli
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

We present the results of the VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects, which we organized as part of the fourth edition of the VarDial workshop at EACL’2017. This year, we included four shared tasks: Discriminating between Similar Languages (DSL), Arabic Dialect Identification (ADI), German Dialect Identification (GDI), and Cross-lingual Dependency Parsing (CLP). A total of 19 teams submitted runs across the four tasks, and 15 of them wrote system description papers.

pdf
German Dialect Identification in Interview Transcriptions
Shervin Malmasi | Marcos Zampieri
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

This paper presents three systems submitted to the German Dialect Identification (GDI) task at the VarDial Evaluation Campaign 2017. The task consists of training models to identify the dialect of Swiss-German speech transcripts. The dialects included in the GDI dataset are Basel, Bern, Lucerne, and Zurich. The three systems we submitted are based on: a plurality ensemble, a mean probability ensemble, and a meta-classifier trained on character and word n-grams. The best results were obtained by the meta-classifier achieving 68.1% accuracy and 66.2% F1-score, ranking first among the 10 teams which participated in the GDI shared task.

pdf
Arabic Dialect Identification Using iVectors and ASR Transcripts
Shervin Malmasi | Marcos Zampieri
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

This paper presents the systems submitted by the MAZA team to the Arabic Dialect Identification (ADI) shared task at the VarDial Evaluation Campaign 2017. The goal of the task is to evaluate computational models to identify the dialect of Arabic utterances using both audio and text transcriptions. The ADI shared task dataset included Modern Standard Arabic (MSA) and four Arabic dialects: Egyptian, Gulf, Levantine, and North-African. The three systems submitted by MAZA are based on combinations of multiple machine learning classifiers arranged as (1) voting ensemble; (2) mean probability ensemble; (3) meta-classifier. The best results were obtained by the meta-classifier achieving 71.7% accuracy, ranking second among the six teams which participated in the ADI shared task.

pdf
Native Language Identification on Text and Speech
Marcos Zampieri | Alina Maria Ciobanu | Liviu P. Dinu
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

This paper presents an ensemble system combining the output of multiple SVM classifiers to native language identification (NLI). The system was submitted to the NLI Shared Task 2017 fusion track which featured students essays and spoken responses in form of audio transcriptions and iVectors by non-native English speakers of eleven native languages. Our system competed in the challenge under the team name ZCD and was based on an ensemble of SVM classifiers trained on character n-grams achieving 83.58% accuracy and ranking 3rd in the shared task.

pdf
Complex Word Identification: Challenges in Data Annotation and System Performance
Marcos Zampieri | Shervin Malmasi | Gustavo Paetzold | Lucia Specia
Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017)

This paper revisits the problem of complex word identification (CWI) following up the SemEval CWI shared task. We use ensemble classifiers to investigate how well computational methods can discriminate between complex and non-complex words. Furthermore, we analyze the classification performance to understand what makes lexical complexity challenging. Our findings show that most systems performed poorly on the SemEval CWI dataset, and one of the reasons for that is the way in which human annotation was performed.

2016

pdf
MAZA at SemEval-2016 Task 11: Detecting Lexical Complexity Using a Decision Stump Meta-Classifier
Shervin Malmasi | Marcos Zampieri
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf
LTG at SemEval-2016 Task 11: Complex Word Identification with Classifier Ensembles
Shervin Malmasi | Mark Dras | Marcos Zampieri
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf
MacSaar at SemEval-2016 Task 11: Zipfian and Character Features for ComplexWord Identification
Marcos Zampieri | Liling Tan | Josef van Genabith
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf
Predicting Post Severity in Mental Health Forums
Shervin Malmasi | Marcos Zampieri | Mark Dras
Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology

pdf bib
Findings of the 2016 Conference on Machine Translation
Ondřej Bojar | Rajen Chatterjee | Christian Federmann | Yvette Graham | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | Varvara Logacheva | Christof Monz | Matteo Negri | Aurélie Névéol | Mariana Neves | Martin Popel | Matt Post | Raphael Rubino | Carolina Scarton | Lucia Specia | Marco Turchi | Karin Verspoor | Marcos Zampieri
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf
USAAR: An Operation Sequential Model for Automatic Statistical Post-Editing
Santanu Pal | Marcos Zampieri | Josef van Genabith
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
Preslav Nakov | Marcos Zampieri | Liling Tan | Nikola Ljubešić | Jörg Tiedemann | Shervin Malmasi
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

pdf bib
Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task
Shervin Malmasi | Marcos Zampieri | Nikola Ljubešić | Preslav Nakov | Ahmed Ali | Jörg Tiedemann
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

We present the results of the third edition of the Discriminating between Similar Languages (DSL) shared task, which was organized as part of the VarDial’2016 workshop at COLING’2016. The challenge offered two subtasks: subtask 1 focused on the identification of very similar languages and language varieties in newswire texts, whereas subtask 2 dealt with Arabic dialect identification in speech transcripts. A total of 37 teams registered to participate in the task, 24 teams submitted test results, and 20 teams also wrote system description papers. High-order character n-grams were the most successful feature, and the best classification approaches included traditional supervised learning methods such as SVM, logistic regression, and language models, while deep learning approaches did not perform very well.

pdf
Arabic Dialect Identification in Speech Transcripts
Shervin Malmasi | Marcos Zampieri
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

In this paper we describe a system developed to identify a set of four regional Arabic dialects (Egyptian, Gulf, Levantine, North African) and Modern Standard Arabic (MSA) in a transcribed speech corpus. We competed under the team name MAZA in the Arabic Dialect Identification sub-task of the 2016 Discriminating between Similar Languages (DSL) shared task. Our system achieved an F1-score of 0.51 in the closed training track, ranking first among the 18 teams that participated in the sub-task. Our system utilizes a classifier ensemble with a set of linear models as base classifiers. We experimented with three different ensemble fusion strategies, with the mean probability approach providing the best performance.

pdf
CATaLog Online: Porting a Post-editing Tool to the Web
Santanu Pal | Marcos Zampieri | Sudip Kumar Naskar | Tapas Nayak | Mihaela Vela | Josef van Genabith
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents CATaLog online, a new web-based MT and TM post-editing tool. CATaLog online is a freeware software that can be used through a web browser and it requires only a simple registration. The tool features a number of editing and log functions similar to the desktop version of CATaLog enhanced with several new features that we describe in detail in this paper. CATaLog online is designed to allow users to post-edit both translation memory segments as well as machine translation output. The tool provides a complete set of log information currently not available in most commercial CAT tools. Log information can be used both for project management purposes as well as for the study of the translation process and translator’s productivity.

pdf
Discriminating Similar Languages: Evaluations and Explorations
Cyril Goutte | Serge Léger | Shervin Malmasi | Marcos Zampieri
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present an analysis of the performance of machine learning classifiers on discriminating between similar languages and language varieties. We carried out a number of experiments using the results of the two editions of the Discriminating between Similar Languages (DSL) shared task. We investigate the progress made between the two tasks, estimate an upper bound on possible performance using ensemble and oracle combination, and provide learning curves to help us understand which languages are more challenging. A number of difficult sentences are identified and investigated further with human annotation

pdf
Modeling Language Change in Historical Corpora: The Case of Portuguese
Marcos Zampieri | Shervin Malmasi | Mark Dras
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents a number of experiments to model changes in a historical Portuguese corpus composed of literary texts for the purpose of temporal text classification. Algorithms were trained to classify texts with respect to their publication date taking into account lexical variation represented as word n-grams, and morphosyntactic variation represented by part-of-speech (POS) distribution. We report results of 99.8% accuracy using word unigram features with a Support Vector Machines classifier to predict the publication date of documents in time intervals of both one century and half a century. A feature analysis is performed to investigate the most informative features for this task and how they are linked to language change.

pdf
CATaLog Online: A Web-based CAT Tool for Distributed Translation with Data Capture for APE and Translation Process Research
Santanu Pal | Sudip Kumar Naskar | Marcos Zampieri | Tapas Nayak | Josef van Genabith
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

We present a free web-based CAT tool called CATaLog Online which provides a novel and user-friendly online CAT environment for post-editors/translators. The goal is to support distributed translation, reduce post-editing time and effort, improve the post-editing experience and capture data for incremental MT/APE (automatic post-editing) and translation process research. The tool supports individual as well as batch mode file translation and provides translations from three engines – translation memory (TM), MT and APE. TM suggestions are color coded to accelerate the post-editing task. The users can integrate their personal TM/MT outputs. The tool remotely monitors and records post-editing activities generating an extensive range of post-editing logs.

2015

pdf
AMBRA: A Ranking Approach to Temporal Text Classification
Marcos Zampieri | Alina Maria Ciobanu | Vlad Niculae | Liviu P. Dinu
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf
Can Translation Memories afford not to use paraphrasing ?
Rohit Gupta | Constantin Orasan | Marcos Zampieri | Mihaela Vela | Josef van Genabith
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf
Searching for Context: a Study on Document-Level Labels for Translation Quality Estimation
Carolina Scarton | Marcos Zampieri | Mihaela Vela | Josef van Genabith | Lucia Specia
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf
Can Translation Memories afford not to use paraphrasing?
Rohit Gupta | Constantin Orăsan | Marcos Zampieri | Mihaela Vela | Josef van Genabith
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf
Searching for Context: a Study on Document-Level Labels for Translation Quality Estimation
Carolina Scarton | Marcos Zampieri | Mihaela Vela | Josef van Genabith | Lucia Specia
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf
CATaLog: New Approaches to TM and Post Editing Interfaces
Tapas Nayek | Sudip Kumar Naskar | Santanu Pal | Marcos Zampieri | Mihaela Vela | Josef van Genabith
Proceedings of the Workshop Natural Language Processing for Translation Memories

pdf bib
Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects
Preslav Nakov | Marcos Zampieri | Petya Osenova | Liling Tan | Cristina Vertan | Nikola Ljubešić | Jörg Tiedemann
Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects

pdf bib
Overview of the DSL Shared Task 2015
Marcos Zampieri | Liling Tan | Nikola Ljubešić | Jörg Tiedemann | Preslav Nakov
Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects

pdf
Comparing Approaches to the Identification of Similar Languages
Marcos Zampieri | Binyam Gebrekidan Gebre | Hernani Costa | Josef van Genabith
Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects

2014

pdf
VarClass: An Open-source Language Identification Tool for Language Varieties
Marcos Zampieri | Binyam Gebre
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents VarClass, an open-source tool for language identification available both to be downloaded as well as through a graphical user-friendly interface. The main difference of VarClass in comparison to other state-of-the-art language identification tools is its focus on language varieties. General purpose language identification tools do not take language varieties into account and our work aims to fill this gap. VarClass currently contains language models for over 27 languages in which 10 of them are language varieties. We report an average performance of over 90.5% accuracy in a challenging dataset. More language models will be included in the upcoming months.

pdf
Quantifying the Influence of MT Output in the Translators’ Performance: A Case Study in Technical Translation
Marcos Zampieri | Mihaela Vela
Proceedings of the EACL 2014 Workshop on Humans and Computer-assisted Translation

pdf bib
Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects
Marcos Zampieri | Liling Tan | Nikola Ljubešić | Jörg Tiedemann
Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects

pdf
A Report on the DSL Shared Task 2014
Marcos Zampieri | Liling Tan | Nikola Ljubešić | Jörg Tiedemann
Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects

pdf
Temporal Text Ranking and Automatic Dating of Texts
Vlad Niculae | Marcos Zampieri | Liviu Dinu | Alina Maria Ciobanu
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers

2013

pdf
Improving Native Language Identification with TF-IDF Weighting
Binyam Gebrekidan Gebre | Marcos Zampieri | Peter Wittenburg | Tom Heskes
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications

pdf
Effective Spell Checking Methods Using Clustering Algorithms
Renato Cordeiro de Amorim | Marcos Zampieri
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

pdf
N-gram Language Models and POS Distribution for the Identification of Spanish Varieties (Ngrammes et Traits Morphosyntaxiques pour la Identification de Variétés de l’Espagnol) [in French]
Marcos Zampieri | Binyam Gebrekidan Gebre | Sascha Diwersy
Proceedings of TALN 2013 (Volume 2: Short Papers)

Search
Co-authors