Proceedings of the 13th International Workshop on Semantic Evaluation

Jonathan May, Ekaterina Shutova, Aurelie Herbelot, Xiaodan Zhu, Marianna Apidianaki, Saif M. Mohammad (Editors)

Anthology ID:
Minneapolis, Minnesota, USA
Association for Computational Linguistics
Bib Export formats:

Proceedings of the 13th International Workshop on Semantic Evaluation
Jonathan May | Ekaterina Shutova | Aurelie Herbelot | Xiaodan Zhu | Marianna Apidianaki | Saif M. Mohammad

SemEval-2019 Task 1: Cross-lingual Semantic Parsing with UCCA
Daniel Hershcovich | Zohar Aizenbud | Leshem Choshen | Elior Sulem | Ari Rappoport | Omri Abend

We present the SemEval 2019 shared task on Universal Conceptual Cognitive Annotation (UCCA) parsing in English, German and French, and discuss the participating systems and results. UCCA is a cross-linguistically applicable framework for semantic representation, which builds on extensive typological work and supports rapid annotation. UCCA poses a challenge for existing parsing techniques, as it exhibits reentrancy (resulting in DAG structures), discontinuous structures and non-terminal nodes corresponding to complex semantic units. The shared task has yielded improvements over the state-of-the-art baseline in all languages and settings. Full results can be found in the task’s website

HLT@SUDA at SemEval-2019 Task 1: UCCA Graph Parsing as Constituent Tree Parsing
Wei Jiang | Zhenghua Li | Yu Zhang | Min Zhang

This paper describes a simple UCCA semantic graph parsing approach. The key idea is to convert a UCCA semantic graph into a constituent tree, in which extra labels are deliberately designed to mark remote edges and discontinuous nodes for future recovery. In this way, we can make use of existing syntactic parsing techniques. Based on the data statistics, we recover discontinuous nodes directly according to the output labels of the constituent parser and use a biaffine classification model to recover the more complex remote edges. The classification model and the constituent parser are simultaneously trained under the multi-task learning framework. We use the multilingual BERT as extra features in the open tracks. Our system ranks the first place in the six English/German closed/open tracks among seven participating systems. For the seventh cross-lingual track, where there is little training data for French, we propose a language embedding approach to utilize English and German training data, and our result ranks the second place.

SemEval-2019 Task 2: Unsupervised Lexical Frame Induction
Behrang QasemiZadeh | Miriam R. L. Petruck | Regina Stodden | Laura Kallmeyer | Marie Candito

This paper presents Unsupervised Lexical Frame Induction, Task 2 of the International Workshop on Semantic Evaluation in 2019. Given a set of prespecified syntactic forms in context, the task requires that verbs and their arguments be clustered to resemble semantic frame structures. Results are useful in identifying polysemous words, i.e., those whose frame structures are not easily distinguished, as well as discerning semantic relations of the arguments. Evaluation of unsupervised frame induction methods fell into two tracks: Task A) Verb Clustering based on FrameNet 1.7; and B) Argument Clustering, with B.1) based on FrameNet’s core frame elements, and B.2) on VerbNet 3.2 semantic roles. The shared task attracted nine teams, of whom three reported promising results. This paper describes the task and its data, reports on methods and resources that these systems used, and offers a comparison to human annotation.

Neural GRANNy at SemEval-2019 Task 2: A combined approach for better modeling of semantic relationships in semantic frame induction
Nikolay Arefyev | Boris Sheludko | Adis Davletov | Dmitry Kharchev | Alex Nevidomsky | Alexander Panchenko

We describe our solutions for semantic frame and role induction subtasks of SemEval 2019 Task 2. Our approaches got the highest scores, and the solution for the frame induction problem officially took the first place. The main contributions of this paper are related to the semantic frame induction problem. We propose a combined approach that employs two different types of vector representations: dense representations from hidden layers of a masked language model, and sparse representations based on substitutes for the target word in the context. The first one better groups synonyms, the second one is better at disambiguating homonyms. Extending the context to include nearby sentences improves the results in both cases. New Hearst-like patterns for verbs are introduced that prove to be effective for frame induction. Finally, we propose an approach to selecting the number of clusters in agglomerative clustering.

SemEval-2019 Task 3: EmoContext Contextual Emotion Detection in Text
Ankush Chatterjee | Kedhar Nath Narahari | Meghana Joshi | Puneet Agrawal

In this paper, we present the SemEval-2019 Task 3 - EmoContext: Contextual Emotion Detection in Text. Lack of facial expressions and voice modulations make detecting emotions in text a challenging problem. For instance, as humans, on reading “Why don’t you ever text me!” we can either interpret it as a sad or angry emotion and the same ambiguity exists for machines. However, the context of dialogue can prove helpful in detection of the emotion. In this task, given a textual dialogue i.e. an utterance along with two previous turns of context, the goal was to infer the underlying emotion of the utterance by choosing from four emotion classes - Happy, Sad, Angry and Others. To facilitate the participation in this task, textual dialogues from user interaction with a conversational agent were taken and annotated for emotion classes after several data processing steps. A training data set of 30160 dialogues, and two evaluation data sets, Test1 and Test2, containing 2755 and 5509 dialogues respectively were released to the participants. A total of 311 teams made submissions to this task. The final leader-board was evaluated on Test2 data set, and the highest ranked submission achieved 79.59 micro-averaged F1 score. Our analysis of systems submitted to the task indicate that Bi-directional LSTM was the most common choice of neural architecture used, and most of the systems had the best performance for the Sad emotion class, and the worst for the Happy emotion class.

ANA at SemEval-2019 Task 3: Contextual Emotion detection in Conversations through hierarchical LSTMs and BERT
Chenyang Huang | Amine Trabelsi | Osmar Zaïane

This paper describes the system submitted by ANA Team for the SemEval-2019 Task 3: EmoContext. We propose a novel Hierarchi- cal LSTMs for Contextual Emotion Detection (HRLCE) model. It classifies the emotion of an utterance given its conversational con- text. The results show that, in this task, our HRCLE outperforms the most recent state-of- the-art text classification framework: BERT. We combine the results generated by BERT and HRCLE to achieve an overall score of 0.7709 which ranked 5th on the final leader board of the competition among 165 Teams.

SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter
Valerio Basile | Cristina Bosco | Elisabetta Fersini | Debora Nozza | Viviana Patti | Francisco Manuel Rangel Pardo | Paolo Rosso | Manuela Sanguinetti

The paper describes the organization of the SemEval 2019 Task 5 about the detection of hate speech against immigrants and women in Spanish and English messages extracted from Twitter. The task is organized in two related classification subtasks: a main binary subtask for detecting the presence of hate speech, and a finer-grained one devoted to identifying further features in hateful contents such as the aggressive attitude and the target harassed, to distinguish if the incitement is against an individual rather than a group. HatEval has been one of the most popular tasks in SemEval-2019 with a total of 108 submitted runs for Subtask A and 70 runs for Subtask B, from a total of 74 different teams. Data provided for the task are described by showing how they have been collected and annotated. Moreover, the paper provides an analysis and discussion about the participant systems and the results they achieved in both subtasks.

Atalaya at SemEval 2019 Task 5: Robust Embeddings for Tweet Classification
Juan Manuel Pérez | Franco M. Luque

In this article, we describe our participation in HatEval, a shared task aimed at the detection of hate speech against immigrants and women. We focused on Spanish subtasks, building from our previous experiences on sentiment analysis in this language. We trained linear classifiers and Recurrent Neural Networks, using classic features, such as bag-of-words, bag-of-characters, and word embeddings, and also with recent techniques such as contextualized word representations. In particular, we trained robust task-oriented subword-aware embeddings and computed tweet representations using a weighted-averaging strategy. In the final evaluation, our systems showed competitive results for both Spanish subtasks ES-A and ES-B, achieving the first and fourth places respectively.

FERMI at SemEval-2019 Task 5: Using Sentence embeddings to Identify Hate Speech Against Immigrants and Women in Twitter
Vijayasaradhi Indurthi | Bakhtiyar Syed | Manish Shrivastava | Nikhil Chakravartula | Manish Gupta | Vasudeva Varma

This paper describes our system (Fermi) for Task 5 of SemEval-2019: HatEval: Multilingual Detection of Hate Speech Against Immigrants and Women on Twitter. We participated in the subtask A for English and ranked first in the evaluation on the test set. We evaluate the quality of multiple sentence embeddings and explore multiple training models to evaluate the performance of simple yet effective embedding-ML combination algorithms. Our team - Fermi’s model achieved an accuracy of 65.00% for English language in task A. Our models, which use pretrained Universal Encoder sentence embeddings for transforming the input and SVM (with RBF kernel) for classification, scored first position (among 68) in the leaderboard on the test set for Subtask A in English language. In this paper we provide a detailed description of the approach, as well as the results obtained in the task.

SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval)
Marcos Zampieri | Shervin Malmasi | Preslav Nakov | Sara Rosenthal | Noura Farra | Ritesh Kumar

We present the results and the main findings of SemEval-2019 Task 6 on Identifying and Categorizing Offensive Language in Social Media (OffensEval). The task was based on a new dataset, the Offensive Language Identification Dataset (OLID), which contains over 14,000 English tweets, and it featured three sub-tasks. In sub-task A, systems were asked to discriminate between offensive and non-offensive posts. In sub-task B, systems had to identify the type of offensive content in the post. Finally, in sub-task C, systems had to detect the target of the offensive posts. OffensEval attracted a large number of participants and it was one of the most popular tasks in SemEval-2019. In total, nearly 800 teams signed up to participate in the task and 115 of them submitted results, which are presented and analyzed in this report.

NULI at SemEval-2019 Task 6: Transfer Learning for Offensive Language Detection using Bidirectional Transformers
Ping Liu | Wen Li | Liang Zou

Transfer learning and domain adaptive learning have been applied to various fields including computer vision (e.g., image recognition) and natural language processing (e.g., text classification). One of the benefits of transfer learning is to learn effectively and efficiently from limited labeled data with a pre-trained model. In the shared task of identifying and categorizing offensive language in social media, we preprocess the dataset according to the language behaviors on social media, and then adapt and fine-tune the Bidirectional Encoder Representation from Transformer (BERT) pre-trained by Google AI Language team. Our team NULI wins the first place (1st) in Sub-task A - Offensive Language Identification and is ranked 4th and 18th in Sub-task B - Automatic Categorization of Offense Types and Sub-task C - Offense Target Identification respectively.

CUNY-PKU Parser at SemEval-2019 Task 1: Cross-Lingual Semantic Parsing with UCCA
Weimin Lyu | Sheng Huang | Abdul Rafae Khan | Shengqiang Zhang | Weiwei Sun | Jia Xu

This paper describes the systems of the CUNY-PKU team in SemEval 2019 Task 1: Cross-lingual Semantic Parsing with UCCA. We introduce a novel model by applying a cascaded MLP and BiLSTM model. Then, we ensemble multiple system-outputs by reparsing. In particular, we introduce a new decoding algorithm for building the UCCA representation. Our system won the first place in one track (French-20K-Open), second places in four tracks (English-Wiki-Open, English-20K-Open, German-20K-Open, and German-20K-Closed), and third place in one track (English-20K-Closed), among all seven tracks.

DANGNT@UIT.VNU-HCM at SemEval 2019 Task 1: Graph Transformation System from Stanford Basic Dependencies to Universal Conceptual Cognitive Annotation (UCCA)
Dang Tuan Nguyen | Trung Tran

This paper describes the graph transfor-mation system (GT System) for SemEval 2019 Task 1: Cross-lingual Semantic Parsing with Universal Conceptual Cognitive Annotation (UCCA)1. The input of GT System is a pair of text and its unannotated xml, which is a layer 0 part of UCCA form. The output of GT System is the corresponding full UCCA xml. Based on the idea of graph illustration and transformation, we perform four main tasks when building GT System. At the first task, we illustrate the graph form of stanford dependencies2 of input text. We then transform into an intermediate graph in the second task. At the third task, we continue to transform into ouput graph form. Finally, we create the output UCCA xml. The evaluation results show that our method generates good-quality UCCA xml and has a meaningful contribution to the semantic represetation sub-field in Natural Language Processing.

GCN-Sem at SemEval-2019 Task 1: Semantic Parsing using Graph Convolutional and Recurrent Neural Networks
Shiva Taslimipoor | Omid Rohanian | Sara Može

This paper describes the system submitted to the SemEval 2019 shared task 1 ‘Cross-lingual Semantic Parsing with UCCA’. We rely on the semantic dependency parse trees provided in the shared task which are converted from the original UCCA files and model the task as tagging. The aim is to predict the graph structure of the output along with the types of relations among the nodes. Our proposed neural architecture is composed of Graph Convolution and BiLSTM components. The layers of the system share their weights while predicting dependency links and semantic labels. The system is applied to the CONLLU format of the input data and is best suited for semantic dependency parsing.

MaskParse@Deskin at SemEval-2019 Task 1: Cross-lingual UCCA Semantic Parsing using Recursive Masked Sequence Tagging
Gabriel Marzinotto | Johannes Heinecke | Géraldine Damnati

This paper describes our recursive system for SemEval-2019 Task 1: Cross-lingual Semantic Parsing with UCCA. Each recursive step consists of two parts. We first perform semantic parsing using a sequence tagger to estimate the probabilities of the UCCA categories in the sentence. Then, we apply a decoding policy which interprets these probabilities and builds the graph nodes. Parsing is done recursively, we perform a first inference on the sentence to extract the main scenes and links and then we recursively apply our model on the sentence using a masking features that reflects the decisions made in previous steps. Process continues until the terminal nodes are reached. We chose a standard neural tagger and we focus on our recursive parsing strategy and on the cross lingual transfer problem to develop a robust model for the French language, using only few training samples

Tüpa at SemEval-2019 Task1: (Almost) feature-free Semantic Parsing
Tobias Pütz | Kevin Glocker

Our submission for Task 1 ‘Cross-lingual Semantic Parsing with UCCA’ at SemEval-2018 is a feed-forward neural network that builds upon an existing state-of-the-art transition-based directed acyclic graph parser. We replace most of its features by deep contextualized word embeddings and introduce an approximation to represent non-terminal nodes in the graph as an aggregation of their terminal children. We further demonstrate how augmenting data using the baseline systems provides a consistent advantage in all open submission tracks. We submitted results to all open tracks (English, in- and out-of-domain, German in-domain and French in-domain, low-resource). Our system achieves competitive performance in all settings besides the French, where we did not augment the data. Post-evaluation experiments showed that data augmentation is especially crucial in this setting.

UC Davis at SemEval-2019 Task 1: DAG Semantic Parsing with Attention-based Decoder
Dian Yu | Kenji Sagae

We present an encoder-decoder model for semantic parsing with UCCA SemEval 2019 Task 1. The encoder is a Bi-LSTM and the decoder uses recursive self-attention. The proposed model alleviates challenges and feature engineering in traditional transition-based and graph-based parsers. The resulting parser is simple and proved to effective on the semantic parsing task.

HHMM at SemEval-2019 Task 2: Unsupervised Frame Induction using Contextualized Word Embeddings
Saba Anwar | Dmitry Ustalov | Nikolay Arefyev | Simone Paolo Ponzetto | Chris Biemann | Alexander Panchenko

We present our system for semantic frame induction that showed the best performance in Subtask B.1 and finished as the runner-up in Subtask A of the SemEval 2019 Task 2 on unsupervised semantic frame induction (Qasem-iZadeh et al., 2019). Our approach separates this task into two independent steps: verb clustering using word and their context embeddings and role labeling by combining these embeddings with syntactical features. A simple combination of these steps shows very competitive results and can be extended to process other datasets and languages.

L2F/INESC-ID at SemEval-2019 Task 2: Unsupervised Lexical Semantic Frame Induction using Contextualized Word Representations
Eugénio Ribeiro | Vânia Mendonça | Ricardo Ribeiro | David Martins de Matos | Alberto Sardinha | Ana Lúcia Santos | Luísa Coheur

Building large datasets annotated with semantic information, such as FrameNet, is an expensive process. Consequently, such resources are unavailable for many languages and specific domains. This problem can be alleviated by using unsupervised approaches to induce the frames evoked by a collection of documents. That is the objective of the second task of SemEval 2019, which comprises three subtasks: clustering of verbs that evoke the same frame and clustering of arguments into both frame-specific slots and semantic roles. We approach all the subtasks by applying a graph clustering algorithm on contextualized embedding representations of the verbs and arguments. Using such representations is appropriate in the context of this task, since they provide cues for word-sense disambiguation. Thus, they can be used to identify different frames evoked by the same words. Using this approach we were able to outperform all of the baselines reported for the task on the test set in terms of Purity F1, as well as in terms of BCubed F1 in most cases.

BrainEE at SemEval-2019 Task 3: Ensembling Linear Classifiers for Emotion Prediction
Vachagan Gratian

The paper describes an ensemble of linear perceptrons trained for emotion classification as part of the SemEval-2019 shared-task 3. The model uses a matrix of probabilities to weight the activations of the base-classifiers and makes a final prediction using the sum rule. The base-classifiers are multi-class perceptrons utilizing character and word n-grams, part-of-speech tags and sentiment polarity scores. The results of our experiments indicate that the ensemble outperforms the base-classifiers, but only marginally. In the best scenario our model attains an F-Micro score of 0.672, whereas the base-classifiers attained scores ranging from 0.636 to 0.666.

CAiRE_HKUST at SemEval-2019 Task 3: Hierarchical Attention for Dialogue Emotion Classification
Genta Indra Winata | Andrea Madotto | Zhaojiang Lin | Jamin Shin | Yan Xu | Peng Xu | Pascale Fung

Detecting emotion from dialogue is a challenge that has not yet been extensively surveyed. One could consider the emotion of each dialogue turn to be independent, but in this paper, we introduce a hierarchical approach to classify emotion, hypothesizing that the current emotional state depends on previous latent emotions. We benchmark several feature-based classifiers using pre-trained word and emotion embeddings, state-of-the-art end-to-end neural network models, and Gaussian processes for automatic hyper-parameter search. In our experiments, hierarchical architectures consistently give significant improvements, and our best model achieves a 76.77% F1-score on the test set.

CECL at SemEval-2019 Task 3: Using Surface Learning for Detecting Emotion in Textual Conversations
Yves Bestgen

This paper describes the system developed by the Centre for English Corpus Linguistics for the SemEval-2019 Task 3: EmoContext. It aimed at classifying the emotion of a user utterance in a textual conversation as happy, sad, angry or other. It is based on a large number of feature types, mainly unigrams and bigrams, which were extracted by a SAS program. The usefulness of the different feature types was evaluated by means of Monte-Carlo resampling tests. As this system does not rest on any deep learning component, which is currently considered as the state-of-the-art approach, it can be seen as a possible point of comparison for such kind of systems.

CLaC Lab at SemEval-2019 Task 3: Contextual Emotion Detection Using a Combination of Neural Networks and SVM
Elham Mohammadi | Hessam Amini | Leila Kosseim

This paper describes our system at SemEval 2019, Task 3 (EmoContext), which focused on the contextual detection of emotions in a dataset of 3-round dialogues. For our final system, we used a neural network with pretrained ELMo word embeddings and POS tags as input, GRUs as hidden units, an attention mechanism to capture representations of the dialogues, and an SVM classifier which used the learned network representations to perform the task of multi-class classification.This system yielded a micro-averaged F1 score of 0.7072 for the three emotion classes, improving the baseline by approximately 12%.

CLARK at SemEval-2019 Task 3: Exploring the Role of Context to Identify Emotion in a Short Conversation
Joseph Cummings | Jason Wilson

With text lacking valuable information avail-able in other modalities, context may provide useful information to better detect emotions. In this paper, we do a systematic exploration of the role of context in recognizing emotion in a conversation. We use a Naive Bayes model to show that inferring the mood of the conversation before classifying individual utterances leads to better performance. Additionally, we find that using context while train-ing the model significantly decreases performance. Our approach has the additional bene-fit that its performance rivals a baseline LSTM model while requiring fewer resources.

CLP at SemEval-2019 Task 3: Multi-Encoder in Hierarchical Attention Networks for Contextual Emotion Detection
Changjie Li | Yun Xing

In this paper, we describe the participation of team ”CLP” in SemEval-2019 Task 3 “Con- textual Emotion Detection in Text” that aims to classify emotion of user utterance in tex- tual conversation. The submitted system is a deep learning architecture based on Hier- archical Attention Networks (HAN) and Em- bedding from Language Model (ELMo). The core of the architecture contains two represen- tation layers. The first one combines the out- puts of ELMo, hand-craft features and Bidi- rectional Long Short-Term Memory with At- tention (Bi-LSTM-Attention) to represent user utterance. The second layer use a Bi-LSTM- Attention encoder to represent the conversa- tion. Our system achieved F1 score of 0.7524 which outperformed the baseline model of the organizers by 0.1656.

CoAStaL at SemEval-2019 Task 3: Affect Classification in Dialogue using Attentive BiLSTMs
Ana Valeria González | Victor Petrén Bach Hansen | Joachim Bingel | Anders Søgaard

This work describes the system presented by the CoAStaL Natural Language Processing group at University of Copenhagen. The main system we present uses the same attention mechanism presented in (Yang et al., 2016). Our overall model architecture is also inspired by their hierarchical classification model and adapted to deal with classification in dialogue by encoding information at the turn level. We use different encodings for each turn to create a more expressive representation of dialogue context which is then fed into our classifier.We also define a custom preprocessing step in order to deal with language commonly used in interactions across many social media outlets. Our proposed system achieves a micro F1 score of 0.7340 on the test set and shows significant gains in performance compared to a system using dialogue level encoding.

ConSSED at SemEval-2019 Task 3: Configurable Semantic and Sentiment Emotion Detector
Rafał Poświata

This paper describes our system participating in the SemEval-2019 Task 3: EmoContext: Contextual Emotion Detection in Text. The goal was to for a given textual dialogue, i.e. a user utterance along with two turns of context, identify the emotion of user utterance as one of the emotion classes: Happy, Sad, Angry or Others. Our system: ConSSED is a configurable combination of semantic and sentiment neural models. The official task submission achieved a micro-average F1 score of 75.31 which placed us 16th out of 165 participating systems.

CX-ST-RNM at SemEval-2019 Task 3: Fusion of Recurrent Neural Networks Based on Contextualized and Static Word Representations for Contextual Emotion Detection
Michał Perełkiewicz

In this paper, I describe a fusion model combining contextualized and static word representations for approaching the EmoContext task in the SemEval 2019 competition. The model is based on two Recurrent Neural Networks, the first one is fed with a state-of-the-art ELMo deep contextualized word representation and the second one is fed with a static Word2Vec embedding augmented with 10-dimensional affective word feature vector. The proposed model is compared with two baseline models based on a static word representation and a contextualized word representation, separately. My approach achieved officially 0.7278 microaveraged F1 score on the test dataset, ranking 47th out of 165 participants.

ParallelDots at SemEval-2019 Task 3: Domain Adaptation with feature embeddings for Contextual Emotion Analysis
Akansha Jain | Ishita Aggarwal | Ankit Singh

This paper describes our proposed system & experiments performed to detect contextual emotion in texts for SemEval 2019 Task 3. We exploit sentiment information, syntactic patterns & semantic relatedness to capture diverse aspects of the text. Word level embeddings such as Glove, FastText, Emoji along with sentence level embeddings like Skip-Thought, DeepMoji & Unsupervised Sentiment Neuron were used as input features to our architecture. We democratize the learning using ensembling of models with different parameters to produce the final output. This paper discusses comparative analysis of the significance of these embeddings and our approach for the task.

E-LSTM at SemEval-2019 Task 3: Semantic and Sentimental Features Retention for Emotion Detection in Text
Harsh Patel

This paper discusses the solution to the problem statement of the SemEval19: EmoContext competition which is ”Contextual Emotion Detection in Texts”. The paper includes the explanation of an architecture that I created by exploiting the embedding layers of Word2Vec and GloVe using LSTMs as memory unit cells which detects approximate emotion of chats between two people in the English language provided in the textual form. The set of emotions on which the model was trained was Happy, Sad, Angry and Others. The paper also includes an analysis of different conventional machine learning algorithms in comparison to E-LSTM.

ELiRF-UPV at SemEval-2019 Task 3: Snapshot Ensemble of Hierarchical Convolutional Neural Networks for Contextual Emotion Detection
José-Ángel González | Lluís-F. Hurtado | Ferran Pla

This paper describes the approach developed by the ELiRF-UPV team at SemEval 2019 Task 3: Contextual Emotion Detection in Text. We have developed a Snapshot Ensemble of 1D Hierarchical Convolutional Neural Networks to extract features from 3-turn conversations in order to perform contextual emotion detection in text. This Snapshot Ensemble is obtained by averaging the models selected by a Genetic Algorithm that optimizes the evaluation measure. The proposed ensemble obtains better results than a single model and it obtains competitive and promising results on Contextual Emotion Detection in Text.

EmoDet at SemEval-2019 Task 3: Emotion Detection in Text using Deep Learning
Hani Al-Omari | Malak Abdullah | Nabeel Bassam

Task 3, EmoContext, in the International Workshop SemEval 2019 provides training and testing datasets for the participant teams to detect emotion classes (Happy, Sad, Angry, or Others). This paper proposes a participating system (EmoDet) to detect emotions using deep learning architecture. The main input to the system is a combination of Word2Vec word embeddings and a set of semantic features (e.g. from AffectiveTweets Weka-package). The proposed system (EmoDet) ensembles a fully connected neural network architecture and LSTM neural network to obtain performance results that show substantial improvements (F1-Score 0.67) over the baseline model provided by Task 3 organizers (F1-score 0.58).

EMOMINER at SemEval-2019 Task 3: A Stacked BiLSTM Architecture for Contextual Emotion Detection in Text
Nikhil Chakravartula | Vijayasaradhi Indurthi

This paper describes our participation in the SemEval 2019 Task 3 - Contextual Emotion Detection in Text. This task aims to identify emotions, viz. happiness, anger, sadness in the context of a text conversation. Our system is a stacked Bidirectional LSTM, equipped with attention on top of word embeddings pre-trained on a large collection of Twitter data. In this paper, apart from describing our official submission, we elucidate how different deep learning models respond to this task.

EmoSense at SemEval-2019 Task 3: Bidirectional LSTM Network for Contextual Emotion Detection in Textual Conversations
Sergey Smetanin

In this paper, we describe a deep-learning system for emotion detection in textual conversations that participated in SemEval-2019 Task 3 “EmoContext”. We designed a specific architecture of bidirectional LSTM which allows not only to learn semantic and sentiment feature representation, but also to capture user-specific conversation features. To fine-tune word embeddings using distant supervision we additionally collected a significant amount of emotional texts. The system achieved 72.59% micro-average F1 score for emotion classes on the test dataset, thereby significantly outperforming the officially-released baseline. Word embeddings and the source code were released for the research community.

EPITA-ADAPT at SemEval-2019 Task 3: Detecting emotions in textual conversations using deep learning models combination
Abdessalam Bouchekif | Praveen Joshi | Latifa Bouchekif | Haithem Afli

Messaging platforms like WhatsApp, Facebook Messenger and Twitter have gained recently much popularity owing to their ability in connecting users in real-time. The content of these textual messages can be a useful resource for text mining to discover and unhide various aspects, including emotions. In this paper we present our submission for SemEval 2019 task ‘EmoContext’. The task consists of classifying a given textual dialogue into one of four emotion classes: Angry, Happy, Sad and Others. Our proposed system is based on the combination of different deep neural networks techniques. In particular, we use Recurrent Neural Networks (LSTM, B-LSTM, GRU, B-GRU), Convolutional Neural Network (CNN) and Transfer Learning (TL) methodes. Our final system, achieves an F1 score of 74.51% on the subtask evaluation dataset.

Figure Eight at SemEval-2019 Task 3: Ensemble of Transfer Learning Methods for Contextual Emotion Detection
Joan Xiao

This paper describes our transfer learning-based approach to contextual emotion detection as part of SemEval-2019 Task 3. We experiment with transfer learning using pre-trained language models (ULMFiT, OpenAI GPT, and BERT) and fine-tune them on this task. We also train a deep learning model from scratch using pre-trained word embeddings and BiLSTM architecture with attention mechanism. The ensembled model achieves competitive result, ranking ninth out of 165 teams. The result reveals that ULMFiT performs best due to its superior fine-tuning techniques. We propose improvements for future work.

GenSMT at SemEval-2019 Task 3: Contextual Emotion Detection in tweets using multi task generic approach
Dumitru Bogdan

In this paper, we describe our participation in SemEval-2019 Task 3: EmoContext - A Shared Task on Contextual Emotion Detection in Text. We propose a three layer model with a generic, multi-purpose approach that without any task specific optimizations achieve competitive results (f1 score of 0.7096) in the EmoContext task. We describe our development strategy in detail along with an exposition of our results.

GWU NLP Lab at SemEval-2019 Task 3 : EmoContext: Effectiveness ofContextual Information in Models for Emotion Detection inSentence-level at Multi-genre Corpus
Shabnam Tafreshi | Mona Diab

In this paper we present an emotion classifier models that submitted to the SemEval-2019 Task 3 : EmoContext. Our approach is a Gated Recurrent Neural Network (GRU) model with attention layer is bootstrapped with contextual information and trained with a multigenre corpus, which is combination of several popular emotional data sets. We utilize different word embeddings to empirically select the most suited embedding to represent our features. Our aim is to build a robust emotion classifier that can generalize emotion detection, which is to learn emotion cues in a noisy training environment. To fulfill this aim we train our model with a multigenre emotion corpus, this way we leverage from having more training set. We achieved overall %56.05 f1-score and placed 144. Given our aim and noisy training environment, the results are anticipated.

IIT Gandhinagar at SemEval-2019 Task 3: Contextual Emotion Detection Using Deep Learning
Arik Pamnani | Rajat Goel | Jayesh Choudhari | Mayank Singh

Recent advancements in Internet and Mobile infrastructure have resulted in the development of faster and efficient platforms of communication. These platforms include speech, facial and text-based conversational mediums. Majority of these are text-based messaging platforms. Development of Chatbots that automatically understand latent emotions in the textual message is a challenging task. In this paper, we present an automatic emotion detection system that aims to detect the emotion of a person textually conversing with a chatbot. We explore deep learning techniques such as CNN and LSTM based neural networks and outperformed the baseline score by 14%. The trained model and code are kept in public domain.

KGPChamps at SemEval-2019 Task 3: A deep learning approach to detect emotions in the dialog utterances.
Jasabanta Patro | Nitin Choudhary | Kalpit Chittora | Animesh Mukherjee

This paper describes our approach to solve Semeval task 3: EmoContext; where, given a textual dialogue i.e. a user utterance along with two turns of context, we have to classify the emotion associated with the utterance as one of the following emotion classes: Happy, Sad, Angry or Others. To solve this problem, we experiment with different deep learning models ranging from simple bidirectional LSTM (Long and short term memory) model to comparatively complex attention model. We also experiment with word embedding conceptnet along with word embedding generated from bi-directional LSTM taking input characters. We fine-tune different parameters and hyper-parameters associated with each of our models and report the value of evaluating measure i.e. micro precision along with class wise precision, recall and F1-score of each system. We report the bidirectional LSTM model, along with the input word embedding as the concatenation of word embedding generated from bidirectional LSTM for word characters and conceptnet embedding, as the best performing model with a highest micro-F1 score of 0.7261. We also report class wise precision, recall, and f1-score of best performing model along with other models that we have experimented with.

KSU at SemEval-2019 Task 3: Hybrid Features for Emotion Recognition in Textual Conversation
Nourah Alswaidan | Mohamed El Bachir Menai

We proposed a model to address emotion recognition in textual conversation based on using automatically extracted features and human engineered features. The proposed model utilizes a fast gated-recurrent-unit backed by CuDNN, and a convolutional neural network to automatically extract features. The human engineered features take the frequency-inverse document frequency of semantic meaning and mood tags extracted from SinticNet.

LIRMM-Advanse at SemEval-2019 Task 3: Attentive Conversation Modeling for Emotion Detection and Classification
Waleed Ragheb | Jérôme Azé | Sandra Bringay | Maximilien Servajean

This paper addresses the problem of modeling textual conversations and detecting emotions. Our proposed model makes use of 1) deep transfer learning rather than the classical shallow methods of word embedding; 2) self-attention mechanisms to focus on the most important parts of the texts and 3) turn-based conversational modeling for classifying the emotions. The approach does not rely on any hand-crafted features or lexicons. Our model was evaluated on the data provided by the SemEval-2019 shared task on contextual emotion detection in text. The model shows very competitive results.

MILAB at SemEval-2019 Task 3: Multi-View Turn-by-Turn Model for Context-Aware Sentiment Analysis
Yoonhyung Lee | Yanghoon Kim | Kyomin Jung

This paper describes our system for SemEval-2019 Task 3: EmoContext, which aims to predict the emotion of the third utterance considering two preceding utterances in a dialogue. To address this challenge of predicting the emotion considering its context, we propose a Multi-View Turn-by-Turn (MVTT) model. Firstly, MVTT model generates vectors from each utterance using two encoders: word-level Bi-GRU encoder (WLE) and character-level CNN encoder (CLE). Then, MVTT grasps contextual information by combining the vectors and predict the emotion with the contextual information. We conduct experiments on the effect of vector encoding and vector combination. Our final MVTT model achieved 0.7634 microaveraged F1 score.

MoonGrad at SemEval-2019 Task 3: Ensemble BiRNNs for Contextual Emotion Detection in Dialogues
Chandrakant Bothe | Stefan Wermter

When reading “I don’t want to talk to you any more”, we might interpret this as either an angry or a sad emotion in the absence of context. Often, the utterances are shorter, and given a short utterance like “Me too!”, it is difficult to interpret the emotion without context. The lack of prosodic or visual information makes it a challenging problem to detect such emotions only with text. However, using contextual information in the dialogue is gaining importance to provide a context-aware recognition of linguistic features such as emotion, dialogue act, sentiment etc. The SemEval 2019 Task 3 EmoContext competition provides a dataset of three-turn dialogues labeled with the three emotion classes, i.e. Happy, Sad and Angry, and in addition with Others as none of the aforementioned emotion classes. We develop an ensemble of the recurrent neural model with character- and word-level features as an input to solve this problem. The system performs quite well, achieving a microaveraged F1 score (F1μ) of 0.7212 for the three emotion classes.

NELEC at SemEval-2019 Task 3: Think Twice Before Going Deep
Parag Agrawal | Anshuman Suri

Existing Machine Learning techniques yield close to human performance on text-based classification tasks. However, the presence of multi-modal noise in chat data such as emoticons, slang, spelling mistakes, code-mixed data, etc. makes existing deep-learning solutions perform poorly. The inability of deep-learning systems to robustly capture these covariates puts a cap on their performance. We propose NELEC: Neural and Lexical Combiner, a system which elegantly combines textual and deep-learning based methods for sentiment classification. We evaluate our system as part of the third task of ‘Contextual Emotion Detection in Text’ as part of SemEval-2019. Our system performs significantly better than the baseline, as well as our deep-learning model benchmarks. It achieved a micro-averaged F1 score of 0.7765, ranking 3rd on the test-set leader-board. Our code is available at

NL-FIIT at SemEval-2019 Task 3: Emotion Detection From Conversational Triplets Using Hierarchical Encoders
Michal Farkas | Peter Lacko

In this paper, we present our system submission for the EmoContext, the third task of the SemEval 2019 workshop. Our solution is a hierarchical recurrent neural network with ELMo embeddings and regularization through dropout and Gaussian noise. We have mainly experimented with two main model architectures: simple and hierarchical LSTM network. We have also examined ensembling of the models and various variants of an ensemble. We have achieved microF1 score of 0.7481, which is significantly higher than baseline and currently the 19th best submission.

NTUA-ISLab at SemEval-2019 Task 3: Determining emotions in contextual conversations with deep learning
Rolandos Alexandros Potamias | Georgios Siolas

Sentiment analysis (SA) in texts is a well-studied Natural Language Processing task, which in nowadays gains popularity due to the explosion of social media, and the subsequent accumulation of huge amounts of related data. However, capturing emotional states and the sentiment polarity of written excerpts requires knowledge on the events triggering them. Towards this goal, we present a computational end-to-end context-aware SA methodology, which was competed in the context of the SemEval-2019 / EmoContext task (Task 3). The proposed system is founded on the combination of two neural architectures, a deep recurrent neural network, structured by an attentive Bidirectional LSTM, and a deep dense network (DNN). The system achieved 0.745 micro f1-score, and ranked 26/165 (top 20%) teams among the official task submissions.

ntuer at SemEval-2019 Task 3: Emotion Classification with Word and Sentence Representations in RCNN
Peixiang Zhong | Chunyan Miao

In this paper we present our model on the task of emotion detection in textual conversations in SemEval-2019. Our model extends the Recurrent Convolutional Neural Network (RCNN) by using external fine-tuned word representations and DeepMoji sentence representations. We also explored several other competitive pre-trained word and sentence representations including ELMo, BERT and InferSent but found inferior performance. In addition, we conducted extensive sensitivity analysis, which empirically shows that our model is relatively robust to hyper-parameters. Our model requires no handcrafted features or emotion lexicons but achieved good performance with a micro-F1 score of 0.7463.

PKUSE at SemEval-2019 Task 3: Emotion Detection with Emotion-Oriented Neural Attention Network
Luyao Ma | Long Zhang | Wei Ye | Wenhui Hu

This paper presents the system in SemEval-2019 Task 3, “EmoContext: Contextual Emotion Detection in Text”. We propose a deep learning architecture with bidirectional LSTM networks, augmented with an emotion-oriented attention network that is capable of extracting emotion information from an utterance. Experimental results show that our model outperforms its variants and the baseline. Overall, this system has achieved 75.57% for the microaveraged F1 score.

Podlab at SemEval-2019 Task 3: The Importance of Being Shallow
Andrew Nguyen | Tobin South | Nigel Bean | Jonathan Tuke | Lewis Mitchell

This paper describes our linear SVM system for emotion classification from conversational dialogue, entered in SemEval2019 Task 3. We used off-the-shelf tools coupled with feature engineering and parameter tuning to create a simple, interpretable, yet high-performing, classification model. Our system achieves a micro F1 score of 0.7357, which is 92% of the top score for the competition, demonstrating that “shallow” classification approaches can perform well when coupled with detailed fea- ture selection and statistical analysis.

SCIA at SemEval-2019 Task 3: Sentiment Analysis in Textual Conversations Using Deep Learning
Zinedine Rebiai | Simon Andersen | Antoine Debrenne | Victor Lafargue

In this paper we present our submission for SemEval-2019 Task 3: EmoContext. The task consisted of classifying a textual dialogue into one of four emotion classes: happy, sad, angry or others. Our approach tried to improve on multiple aspects, preprocessing with an emphasis on spell-checking and ensembling with four different models: Bi-directional contextual LSTM (BC-LSTM), categorical Bi-LSTM (CAT-LSTM), binary convolutional Bi-LSTM (BIN-LSTM) and Gated Recurrent Unit (GRU). On the leader-board, we submitted two systems that obtained a micro F1 score (F1μ) of 0.711 and 0.712. After the competition, we merged our two systems with ensembling, which achieved a F1μ of 0.7324 on the test dataset.

Sentim at SemEval-2019 Task 3: Convolutional Neural Networks For Sentiment in Conversations
Jacob Anderson

In this work convolutional neural networks were used in order to determine the sentiment in a conversational setting. This paper’s contributions include a method for handling any sized input and a method for breaking down the conversation into separate parts for easier processing. Finally, clustering was shown to improve results and that such a model for handling sentiment in conversations is both fast and accurate.

SINAI at SemEval-2019 Task 3: Using affective features for emotion classification in textual conversations
Flor Miriam Plaza-del-Arco | M. Dolores Molina-González | Maite Martin | L. Alfonso Ureña-López

Detecting emotions in textual conversation is a challenging problem in absence of nonverbal cues typically associated with emotion, like fa- cial expression or voice modulations. How- ever, more and more users are using message platforms such as WhatsApp or Telegram. For this reason, it is important to develop systems capable of understanding human emotions in textual conversations. In this paper, we carried out different systems to analyze the emotions of textual dialogue from SemEval-2019 Task 3: EmoContext for English language. Our main contribution is the integration of emotional and sentimental features in the classification using the SVM algorithm.

SNU IDS at SemEval-2019 Task 3: Addressing Training-Test Class Distribution Mismatch in Conversational Classification
Sanghwan Bae | Jihun Choi | Sang-goo Lee

We present several techniques to tackle the mismatch in class distributions between training and test data in the Contextual Emotion Detection task of SemEval 2019, by extending the existing methods for class imbalance problem. Reducing the distance between the distribution of prediction and ground truth, they consistently show positive effects on the performance. Also we propose a novel neural architecture which utilizes representation of overall context as well as of each utterance. The combination of the methods and the models achieved micro F1 score of about 0.766 on the final evaluation.

SSN_NLP at SemEval-2019 Task 3: Contextual Emotion Identification from Textual Conversation using Seq2Seq Deep Neural Network
Senthil Kumar B. | Thenmozhi D. | Aravindan Chandrabose | Srinethe Sharavanan

Emotion identification is a process of identifying the emotions automatically from text, speech or images. Emotion identification from textual conversations is a challenging problem due to absence of gestures, vocal intonation and facial expressions. It enables conversational agents, chat bots and messengers to detect and report the emotions to the user instantly for a healthy conversation by avoiding emotional cues and miscommunications. We have adopted a Seq2Seq deep neural network to identify the emotions present in the text sequences. Several layers namely embedding layer, encoding-decoding layer, softmax layer and a loss layer are used to map the sequences from textual conversations to the emotions namely Angry, Happy, Sad and Others. We have evaluated our approach on the EmoContext@SemEval2019 dataset and we have obtained the micro-averaged F1 scores as 0.595 and 0.6568 for the pre-evaluation dataset and final evaluation test set respectively. Our approach improved the base line score by 7% for final evaluation test set.

SWAP at SemEval-2019 Task 3: Emotion detection in conversations through Tweets, CNN and LSTM deep neural networks
Marco Polignano | Marco de Gemmis | Giovanni Semeraro

Emotion detection from user-generated contents is growing in importance in the area of natural language processing. The approach we proposed for the EmoContext task is based on the combination of a CNN and an LSTM using a concatenation of word embeddings. A stack of convolutional neural networks (CNN) is used for capturing the hierarchical hidden relations among embedding features. Meanwhile, a long short-term memory network (LSTM) is used for capturing information shared among words of the sentence. Each conversation has been formalized as a list of word embeddings, in particular during experimental runs pre-trained Glove and Google word embeddings have been evaluated. Surface lexical features have been also considered, but they have been demonstrated to be not usefully for the classification in this specific task. The final system configuration achieved a micro F1 score of 0.7089. The python code of the system is fully available at

SymantoResearch at SemEval-2019 Task 3: Combined Neural Models for Emotion Classification in Human-Chatbot Conversations
Angelo Basile | Marc Franco-Salvador | Neha Pawar | Sanja Štajner | Mara Chinea Rios | Yassine Benajiba

In this paper, we present our participation to the EmoContext shared task on detecting emotions in English textual conversations between a human and a chatbot. We propose four neural systems and combine them to further improve the results. We show that our neural ensemble systems can successfully distinguish three emotions (SAD, HAPPY, and ANGRY) and separate them from the rest (OTHERS) in a highly-imbalanced scenario. Our best system achieved a 0.77 F1-score and was ranked fourth out of 165 submissions.

TDBot at SemEval-2019 Task 3: Context Aware Emotion Detection Using A Conditioned Classification Approach
Sourabh Maity

With the system description it is shown how to use the context information while detecting the emotion in a dialogue. Some guidelines about how to handle emojis was also laid out. While developing this system I realized the importance of pre-processing in conversational text data, or in general NLP related tasks; it can not be over emphasized.

THU_NGN at SemEval-2019 Task 3: Dialog Emotion Classification using Attentional LSTM-CNN
Suyu Ge | Tao Qi | Chuhan Wu | Yongfeng Huang

With the development of the Internet, dialog systems are widely used in online platforms to provide personalized services for their users. It is important to understand the emotions through conversations to improve the quality of dialog systems. To facilitate the researches on dialog emotion recognition, the SemEval-2019 Task 3 named EmoContext is proposed. This task aims to classify the emotions of user utterance along with two short turns of dialogues into four categories. In this paper, we propose an attentional LSTM-CNN model to participate in this shared task. We use a combination of convolutional neural networks and long-short term neural networks to capture both local and long-distance contextual information in conversations. In addition, we apply attention mechanism to recognize and attend to important words within conversations. Besides, we propose to use ensemble strategies by combing the variants of our model with different pre-trained word embeddings via weighted voting. Our model achieved 0.7542 micro-F1 score in the final test data, ranking 15ˆth out of 165 teams.

THU-HCSI at SemEval-2019 Task 3: Hierarchical Ensemble Classification of Contextual Emotion in Conversation
Xihao Liang | Ye Ma | Mingxing Xu

In this paper, we describe our hierarchical ensemble system designed for the SemEval-2019 task3, EmoContext. In our system, three sets of classifiers are trained for different sub-targets and the predicted labels of these base classifiers are combined through three steps of voting to make the final prediction. Effective details for developing base classifiers are highlighted.

TokyoTech_NLP at SemEval-2019 Task 3: Emotion-related Symbols in Emotion Detection
Zhishen Yang | Sam Vijlbrief | Naoaki Okazaki

This paper presents our contextual emotion detection system in approaching the SemEval2019 shared task 3: EmoContext: Contextual Emotion Detection in Text. This system cooperates with an emotion detection neural network method (Poria et al., 2017), emoji2vec (Eisner et al., 2016) embedding, word2vec embedding (Mikolov et al., 2013), and our proposed emoticon and emoji preprocessing method. The experimental results demonstrate the usefulness of our emoticon and emoji prepossessing method, and representations of emoticons and emoji contribute model’s emotion detection.

UAIC at SemEval-2019 Task 3: Extracting Much from Little
Cristian Simionescu | Ingrid Stoleru | Diana Lucaci | Gheorghe Balan | Iulian Bute | Adrian Iftene

In this paper, we present a system description for implementing a sentiment analysis agent capable of interpreting the state of an interlocutor engaged in short three message conversations. We present the results and observations of our work and which parts could be further improved in the future.

YUN-HPCC at SemEval-2019 Task 3: Multi-Step Ensemble Neural Network for Sentiment Analysis in Textual Conversation
Dawei Li | Jin Wang | Xuejie Zhang

This paper describes our approach to the sentiment analysis of Twitter textual conversations based on deep learning. We analyze the syntax, abbreviations, and informal-writing of Twitter; and perform perfect data preprocessing on the data to convert them to normative text. We apply a multi-step ensemble strategy to solve the problem of extremely unbalanced data in the training set. This is achieved by taking the GloVe and Elmo word vectors as input into a combination model with four different deep neural networks. The experimental results from the development dataset demonstrate that the proposed model exhibits a strong generalization ability. For evaluation on the best dataset, we integrated the results using the stacking ensemble learning approach and achieved competitive results. According to the final official review, the results of our model ranked 10th out of 165 teams.

KDEHatEval at SemEval-2019 Task 5: A Neural Network Model for Detecting Hate Speech in Twitter
Umme Aymun Siddiqua | Abu Nowshed Chy | Masaki Aono

In the age of emerging volume of microblog platforms, especially twitter, hate speech propagation is now of great concern. However, due to the brevity of tweets and informal user generated contents, detecting and analyzing hate speech on twitter is a formidable task. In this paper, we present our approach for detecting hate speech in tweets defined in the SemEval-2019 Task 5. Our team KDEHatEval employs different neural network models including multi-kernel convolution (MKC), nested LSTMs (NLSTMs), and multi-layer perceptron (MLP) in a unified architecture. Moreover, we utilize the state-of-the-art pre-trained sentence embedding models including DeepMoji, InferSent, and BERT for effective tweet representation. We analyze the performance of our method and demonstrate the contribution of each component of our architecture.

ABARUAH at SemEval-2019 Task 5 : Bi-directional LSTM for Hate Speech Detection
Arup Baruah | Ferdous Barbhuiya | Kuntal Dey

In this paper, we present the results obtained using bi-directional long short-term memory (BiLSTM) with and without attention and Logistic Regression (LR) models for SemEval-2019 Task 5 titled ”HatEval: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter”. This paper presents the results obtained for Subtask A for English language. The results of the BiLSTM and LR models are compared for two different types of preprocessing. One with no stemming performed and no stopwords removed. The other with stemming performed and stopwords removed. The BiLSTM model without attention performed the best for the first test, while the LR model with character n-grams performed the best for the second test. The BiLSTM model obtained an F1 score of 0.51 on the test set and obtained an official ranking of 8/71.

Amobee at SemEval-2019 Tasks 5 and 6: Multiple Choice CNN Over Contextual Embedding
Alon Rozental | Dadi Biton

This article describes Amobee’s participation in “HatEval: Multilingual detection of hate speech against immigrants and women in Twitter” (task 5) and “OffensEval: Identifying and Categorizing Offensive Language in Social Media” (task 6). The goal of task 5 was to detect hate speech targeted to women and immigrants. The goal of task 6 was to identify and categorized offensive language in social media, and identify offense target. We present a novel type of convolutional neural network called “Multiple Choice CNN” (MC- CNN) that we used over our newly developed contextual embedding, Rozental et al. (2019). For both tasks we used this architecture and achieved 4th place out of 69 participants with an F1 score of 0.53 in task 5, in task 6 achieved 2nd place (out of 75) in Sub-task B - automatic categorization of offense types (our model reached places 18/2/7 out of 103/75/65 for sub-tasks A, B and C respectively in task 6).

CIC at SemEval-2019 Task 5: Simple Yet Very Efficient Approach to Hate Speech Detection, Aggressive Behavior Detection, and Target Classification in Twitter
Iqra Ameer | Muhammad Hammad Fahim Siddiqui | Grigori Sidorov | Alexander Gelbukh

In recent years, the use of social media has in-creased incredibly. Social media permits Inter-net users a friendly platform to express their views and opinions. Along with these nice and distinct communication chances, it also allows bad things like usage of hate speech. Online automatic hate speech detection in various aspects is a significant scientific problem. This paper presents the Instituto Politécnico Nacional (Mexico) approach for the Semeval 2019 Task-5 [Hateval 2019] (Basile et al., 2019) competition for Multilingual Detection of Hate Speech on Twitter. The goal of this paper is to detect (A) Hate speech against immigrants and women, (B) Aggressive behavior and target classification, both for English and Spanish. In the proposed approach, we used a bag of words model with preprocessing (stem-ming and stop words removal). We submitted two different systems with names: (i) CIC-1 and (ii) CIC-2 for Hateval 2019 shared task. We used TF values in the first system and TF-IDF for the second system. The first system, CIC-1 got 2nd rank in subtask B for both English and Spanish languages with EMR score of 0.568 for English and 0.675 for Spanish. The second system, CIC-2 was ranked 4th in sub-task A and 1st in subtask B for Spanish language with a macro-F1 score of 0.727 and EMR score of 0.705 respectively.

CiTIUS-COLE at SemEval-2019 Task 5: Combining Linguistic Features to Identify Hate Speech Against Immigrants and Women on Multilingual Tweets
Sattam Almatarneh | Pablo Gamallo | Francisco J. Ribadas Pena

This article describes the strategy submitted by the CiTIUS-COLE team to SemEval 2019 Task 5, a task which consists of binary classi- fication where the system predicting whether a tweet in English or in Spanish is hateful against women or immigrants or not. The proposed strategy relies on combining linguis- tic features to improve the classifier’s perfor- mance. More precisely, the method combines textual and lexical features, embedding words with the bag of words in Term Frequency- Inverse Document Frequency (TF-IDF) repre- sentation. The system performance reaches about 81% F1 when it is applied to the training dataset, but its F1 drops to 36% on the official test dataset for the English and 64% for the Spanish language concerning the hate speech class

Grunn2019 at SemEval-2019 Task 5: Shared Task on Multilingual Detection of Hate
Mike Zhang | Roy David | Leon Graumans | Gerben Timmerman

Hate speech occurs more often than ever and polarizes society. To help counter this polarization, SemEval 2019 organizes a shared task called the Multilingual Detection of Hate. The first task (A) is to decide whether a given tweet contains hate against immigrants or women, in a multilingual perspective, for English and Spanish. In the second task (B), the system is also asked to classify the following sub-tasks: hateful tweets as aggressive or not aggressive, and to identify the target harassed as individual or generic. We evaluate multiple models, and finally combine them in an ensemble setting. This ensemble setting is built of five and three submodels for the English and Spanish task respectively. In the current setup it shows that using a bigger ensemble for English tweets performs mediocre, while a slightly smaller ensemble does work well for detecting hate speech in Spanish tweets. Our results on the test set for English show 0.378 macro F1 on task A and 0.553 macro F1 on task B. For Spanish the results are significantly higher, 0.701 macro F1 on task A and 0.734 macro F1 for task B.

GSI-UPM at SemEval-2019 Task 5: Semantic Similarity and Word Embeddings for Multilingual Detection of Hate Speech Against Immigrants and Women on Twitter
Diego Benito | Oscar Araque | Carlos A. Iglesias

This paper describes the GSI-UPM system for SemEval-2019 Task 5, which tackles multilingual detection of hate speech on Twitter. The main contribution of the paper is the use of a method based on word embeddings and semantic similarity combined with traditional paradigms, such as n-grams, TF-IDF and POS. This combination of several features is fine-tuned through ablation tests, demonstrating the usefulness of different features. While our approach outperforms baseline classifiers on different sub-tasks, the best of our submitted runs reached the 5th position on the Spanish sub-task A.

HATEMINER at SemEval-2019 Task 5: Hate speech detection against Immigrants and Women in Twitter using a Multinomial Naive Bayes Classifier
Nikhil Chakravartula

This paper describes our participation in the SemEval 2019 Task 5 - Multilingual Detection of Hate. This task aims to identify hate speech against two specific targets, immigrants and women. We compare and contrast the performance of different word and sentence level embeddings on the state-of-the-art classification algorithms. Our final submission is a Multinomial binarized Naive Bayes model for both the subtasks in the English version.

HATERecognizer at SemEval-2019 Task 5: Using Features and Neural Networks to Face Hate Recognition
Victor Nina-Alcocer

This paper presents a detailed description of our participation in task 5 on SemEval-2019. This task consists of classifying English and Spanish tweets that contain hate towards women or immigrants. We carried out several experiments; for a finer-grained study of the task, we analyzed different features and designing architectures of neural networks. Additionally, to face the lack of hate content in tweets, we include data augmentation as a technique to in- crease hate content in our datasets.

GL at SemEval-2019 Task 5: Identifying hateful tweets with a deep learning approach.
Gretel Liz De la Peña

This paper describes the system we developed for SemEval 2019 on Multilingual detection of hate speech against immigrants and women in Twitter (HatEval - Task 5). We use an approach based on an Attention-based Long Short-Term Memory Recurrent Neural Network. In particular, we build a Bidirectional LSTM to extract information from the word embeddings over the sentence, then apply attention over the hidden states to estimate the importance of each word and finally feed this context vector to another LSTM model to get a representation. Then, the output obtained with this model is used to get the prediction of each of the sub-tasks.

INF-HatEval at SemEval-2019 Task 5: Convolutional Neural Networks for Hate Speech Detection Against Women and Immigrants on Twitter
Alison Ribeiro | Nádia Silva

In this paper, we describe our approach to detect hate speech against women and immigrants on Twitter in a multilingual context, English and Spanish. This challenge was proposed by the SemEval-2019 Task 5, where participants should develop models for hate speech detection, a two-class classification where systems have to predict whether a tweet in English or in Spanish with a given target (women or immigrants) is hateful or not hateful (Task A), and whether the hate speech is directed at a specific person or a group of individuals (Task B). For this, we implemented a Convolutional Neural Networks (CNN) using pre-trained word embeddings (GloVe and FastText) with 300 dimensions. Our proposed model obtained in Task A 0.488 and 0.696 F1-score for English and Spanish, respectively. For Task B, the CNN obtained 0.297 and 0.430 EMR for English and Spanish, respectively.

JCTDHS at SemEval-2019 Task 5: Detection of Hate Speech in Tweets using Deep Learning Methods, Character N-gram Features, and Preprocessing Methods
Yaakov HaCohen-Kerner | Elyashiv Shayovitz | Shalom Rochman | Eli Cahn | Gal Didi | Ziv Ben-David

In this paper, we describe our submissions to SemEval-2019 contest. We tackled subtask A - “a binary classification where systems have to predict whether a tweet with a given target (women or immigrants) is hateful or not hateful”, a part of task 5 “Multilingual detection of hate speech against immigrants and women in Twitter (hatEval)”. Our system JCTDHS (Jerusalem College of Technology Detects Hate Speech) was developed for tweets written in English. We applied various supervised ML methods, various combinations of n-gram features using the TF-IDF scheme and. In addition, we applied various combinations of eight basic preprocessing methods. Our best submission was a special bidirectional RNN, which was ranked at the 11th position out of 68 submissions.

Know-Center at SemEval-2019 Task 5: Multilingual Hate Speech Detection on Twitter using CNNs
Kevin Winter | Roman Kern

This paper presents the Know-Center system submitted for task 5 of the SemEval-2019 workshop. Given a Twitter message in either English or Spanish, the task is to first detect whether it contains hateful speech and second, to determine the target and level of aggression used. For this purpose our system utilizes word embeddings and a neural network architecture, consisting of both dilated and traditional convolution layers. We achieved average F1-scores of 0.57 and 0.74 for English and Spanish respectively.

LT3 at SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter (hatEval)
Nina Bauwelinck | Gilles Jacobs | Véronique Hoste | Els Lefever

This paper describes our contribution to the SemEval-2019 Task 5 on the detection of hate speech against immigrants and women in Twitter (hatEval). We considered a supervised classification-based approach to detect hate speech in English tweets, which combines a variety of standard lexical and syntactic features with specific features for capturing offensive language. Our experimental results show good classification performance on the training data, but a considerable drop in recall on the held-out test set.

ltl.uni-due at SemEval-2019 Task 5: Simple but Effective Lexico-Semantic Features for Detecting Hate Speech in Twitter
Huangpan Zhang | Michael Wojatzki | Tobias Horsmann | Torsten Zesch

In this paper, we present our contribution to SemEval 2019 Task 5 Multilingual Detection of Hate, specifically in the Subtask A (English and Spanish). We compare different configurations of shallow and deep learning approaches on the English data and use the system that performs best in both sub-tasks. The resulting SVM-based system with lexicosemantic features (n-grams and embeddings) is ranked 23rd out of 69 on the English data and beats the baseline system. On the Spanish data our system is ranked 25th out of 39.

MineriaUNAM at SemEval-2019 Task 5: Detecting Hate Speech in Twitter using Multiple Features in a Combinatorial Framework
Luis Enrique Argota Vega | Jorge Carlos Reyes-Magaña | Helena Gómez-Adorno | Gemma Bel-Enguix

This paper presents our approach to the Task 5 of Semeval-2019, which aims at detecting hate speech against immigrants and women in Twitter. The task consists of two sub-tasks, in Spanish and English: (A) detection of hate speech and (B) classification of hateful tweets as aggressive or not, and identification of the target harassed as individual or group. We used linguistically motivated features and several types of n-grams (words, characters, functional words, punctuation symbols, POS, among others). For task A, we trained a Support Vector Machine using a combinatorial framework, whereas for task B we followed a multi-labeled approach using the Random Forest classifier. Our approach achieved the highest F1-score in sub-task A for the Spanish language.

MITRE at SemEval-2019 Task 5: Transfer Learning for Multilingual Hate Speech Detection
Abigail Gertner | John Henderson | Elizabeth Merkhofer | Amy Marsh | Ben Wellner | Guido Zarrella

This paper describes MITRE’s participation in SemEval-2019 Task 5, HatEval: Multilingual detection of hate speech against immigrants and women in Twitter. The techniques explored range from simple bag-of-ngrams classifiers to neural architectures with varied attention mechanisms. We describe several styles of transfer learning from auxiliary tasks, including a novel method for adapting pre-trained BERT models to Twitter data. Logistic regression ties the systems together into an ensemble submitted for evaluation. The resulting system was used to produce predictions for all four HatEval subtasks, achieving the best mean rank of all teams that participated in all four conditions.

Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter at SemEval-2019 Task 5: Frequency Analysis Interpolation for Hate in Speech Detection
Òscar Garibo i Orts

This document describes a text change of representation approach to the task of Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter, as part of SemEval-2019 1 . The task is divided in two sub-tasks. Sub-task A consists in classifying tweets as being hateful or not hateful, whereas sub-task B requires fine tuning the classification by classifying the hateful tweets as being directed to single individuals or generic, if the tweet is aggressive or not. Our approach consists of a change of the space of representation of text into statistical descriptors which characterize the text. In addition, dimensional reduction is performed to 6 characteristics per class in order to make the method suitable for a Big Data environment. Frequency Analysis Interpolation (FAI) is the approach we use to achieve rank 5th in Spanish language and 9th in English language in sub-task B in both cases.

STUFIIT at SemEval-2019 Task 5: Multilingual Hate Speech Detection on Twitter with MUSE and ELMo Embeddings
Michal Bojkovský | Matúš Pikuliak

We present a number of models used for hate speech detection for Semeval 2019 Task-5: Hateval. We evaluate the viability of multilingual learning for this task. We also experiment with adversarial learning as a means of creating a multilingual model. Ultimately our multilingual models have had worse results than their monolignual counterparts. We find that the choice of word representations (word embeddings) is very crucial for deep learning as a simple switch between MUSE and ELMo embeddings has shown a 3-4% increase in accuracy. This also shows the importance of context when dealing with online content.

Saagie at Semeval-2019 Task 5: From Universal Text Embeddings and Classical Features to Domain-specific Text Classification
Miriam Benballa | Sebastien Collet | Romain Picot-Clemente

This paper describes our contribution to SemEval 2019 Task 5: Hateval. We propose to investigate how domain-specific text classification task can benefit from pretrained state of the art language models and how they can be combined with classical handcrafted features. For this purpose, we propose an approach based on a feature-level Meta-Embedding to let the model choose which features to keep and how to use them.

SINAI at SemEval-2019 Task 5: Ensemble learning to detect hate speech against inmigrants and women in English and Spanish tweets
Flor Miriam Plaza-del-Arco | M. Dolores Molina-González | Maite Martin | L. Alfonso Ureña-López

Misogyny and xenophobia are some of the most important social problems. With the in- crease in the use of social media, this feeling ofhatred towards women and immigrants can be more easily expressed, therefore it can cause harmful effects on social media users. For this reason, it is important to develop systems ca- pable of detecting hateful comments automatically. In this paper, we describe our system to analyze the hate speech in English and Spanish tweets against Immigrants and Women as part of our participation in SemEval-2019 Task 5: hatEval. Our main contribution is the integration of three individual algorithms of predic- tion in a model based on Vote ensemble classifier.

SINAI-DL at SemEval-2019 Task 5: Recurrent networks and data augmentation by paraphrasing
Arturo Montejo-Ráez | Salud María Jiménez-Zafra | Miguel A. García-Cumbreras | Manuel Carlos Díaz-Galiano

This paper describes the participation of the SINAI-DL team at Task 5 in SemEval 2019, called HatEval. We have applied some classic neural network layers, like word embeddings and LSTM, to build a neural classifier for both proposed tasks. Due to the small amount of training data provided compared to what is expected for an adequate learning stage in deep architectures, we explore the use of paraphrasing tools as source for data augmentation. Our results show that this method is promising, as some improvement has been found over non-augmented training sets.

sthruggle at SemEval-2019 Task 5: An Ensemble Approach to Hate Speech Detection
Aria Nourbakhsh | Frida Vermeer | Gijs Wiltvank | Rob van der Goot

In this paper, we present our approach to detection of hate speech against women and immigrants in tweets for our participation in the SemEval-2019 Task 5. We trained an SVM and an RF classifier using character bi- and trigram features and a BiLSTM pre-initialized with external word embeddings. We combined the predictions of the SVM, RF and BiLSTM in two different ensemble models. The first was a majority vote of the binary values, and the second used the average of the confidence scores. For development, we got the highest accuracy (75%) by the final ensemble model with majority voting. For testing, all models scored substantially lower and the scores between the classifiers varied more. We believe that these large differences between the higher accuracies in the development phase and the lower accuracies we obtained in the testing phase have partly to do with differences between the training, development and testing data.

The binary trio at SemEval-2019 Task 5: Multitarget Hate Speech Detection in Tweets
Patricia Chiril | Farah Benamara Zitoune | Véronique Moriceau | Abhishek Kumar

The massive growth of user-generated web content through blogs, online forums and most notably, social media networks, led to a large spreading of hatred or abusive messages which have to be moderated. This paper proposes a supervised approach to hate speech detection towards immigrants and women in English tweets. Several models have been developed ranging from feature-engineering approaches to neural ones.

The Titans at SemEval-2019 Task 5: Detection of hate speech against immigrants and women in Twitter
Avishek Garain | Arpan Basu

This system paper is a description of the system submitted to ”SemEval-2019 Task 5” Task B for the English language, where we had to primarily detect hate speech and then detect aggressive behaviour and its target audience in Twitter. There were two specific target audiences, immigrants and women. The language of the tweets was English. We were required to first detect whether a tweet is containing hate speech. Thereafter we were required to find whether the tweet was showing aggressive behaviour, and then we had to find whether the targeted audience was an individual or a group of people.

TuEval at SemEval-2019 Task 5: LSTM Approach to Hate Speech Detection in English and Spanish
Mihai Manolescu | Denise Löfflad | Adham Nasser Mohamed Saber | Masoumeh Moradipour Tari

The detection of hate speech, especially in online platforms and forums, is quickly becoming a hot topic as anti-hate speech legislation begins to be applied to public discourse online. The HatEval shared task was created with this in mind; participants were expected to develop a model capable of determining whether or not input (in this case, Twitter datasets in English and Spanish) could be considered hate speech (designated as Task A), if they were aggressive, and whether the tweet was targeting an individual, or speaking generally (Task B). We approached this task by creating an LSTM model with an embedding layer. We found that our model performed considerably better on English language input when compared to Spanish language input. In English, we achieved an F1-Score of 0.466 for Task A and 0.462 for Task B; In Spanish, we achieved scores of 0.617 and 0.612 on Task A and Task B, respectively.

Tw-StAR at SemEval-2019 Task 5: N-gram embeddings for Hate Speech Detection in Multilingual Tweets
Hala Mulki | Chedi Bechikh Ali | Hatem Haddad | Ismail Babaoğlu

In this paper, we describe our contribution in SemEval-2019: subtask A of task 5 “Multilingual detection of hate speech against immigrants and women in Twitter (HatEval)”. We developed two hate speech detection model variants through Tw-StAR framework. While the first model adopted one-hot encoding ngrams to train an NB classifier, the second generated and learned n-gram embeddings within a feedforward neural network. For both models, specific terms, selected via MWT patterns, were tagged in the input data. With two feature types employed, we could investigate the ability of n-gram embeddings to rival one-hot n-grams. Our results showed that in English, n-gram embeddings outperformed one-hot ngrams. However, representing Spanish tweets by one-hot n-grams yielded a slightly better performance compared to that of n-gram embeddings. The official ranking indicated that Tw-StAR ranked 9th for English and 20th for Spanish.

UA at SemEval-2019 Task 5: Setting A Strong Linear Baseline for Hate Speech Detection
Carlos Perelló | David Tomás | Alberto Garcia-Garcia | Jose Garcia-Rodriguez | Jose Camacho-Collados

This paper describes the system developed at the University of Alicante (UA) for the SemEval 2019 Task 5: Shared Task on Multilingual Detection of Hate. The purpose of this work is to build a strong baseline for hate speech detection, using a traditional machine learning approach with standard textual features, which could serve in a near future as a reference to compare with deep learning systems. We participated in both task A (Hate Speech Detection against Immigrants and Women) and task B (Aggressive behavior and Target Classification). Despite its simplicity, our system obtained a remarkable F1-score of 72.5 (sixth highest) and an accuracy of 73.6 (second highest) in Spanish (task A), outperforming more complex neural models from a total of 40 participant systems.

UNBNLP at SemEval-2019 Task 5 and 6: Using Language Models to Detect Hate Speech and Offensive Language
Ali Hakimi Parizi | Milton King | Paul Cook

In this paper we apply a range of approaches to language modeling – including word-level n-gram and neural language models, and character-level neural language models – to the problem of detecting hate speech and offensive language. Our findings indicate that language models are able to capture knowledge of whether text is hateful or offensive. However, our findings also indicate that more conventional approaches to text classification often perform similarly or better.

UTFPR at SemEval-2019 Task 5: Hate Speech Identification with Recurrent Neural Networks
Gustavo Henrique Paetzold | Marcos Zampieri | Shervin Malmasi

In this paper we revisit the problem of automatically identifying hate speech in posts from social media. We approach the task using a system based on minimalistic compositional Recurrent Neural Networks (RNN). We tested our approach on the SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter (HatEval) shared task dataset. The dataset made available by the HatEval organizers contained English and Spanish posts retrieved from Twitter annotated with respect to the presence of hateful content and its target. In this paper we present the results obtained by our system in comparison to the other entries in the shared task. Our system achieved competitive performance ranking 7th in sub-task A out of 62 systems in the English track.

Vista.ue at SemEval-2019 Task 5: Single Multilingual Hate Speech Detection Model
Kashyap Raiyani | Teresa Gonçalves | Paulo Quaresma | Vitor Nogueira

This paper shares insight from participating in SemEval-2019 Task 5. The main propose of this system-description paper is to facilitate the reader with replicability and to provide insightful analysis of the developed system. Here in Vista.ue, we proposed a single multilingual hate speech detection model. This model was ranked 46/70 for English Task A and 31/43 for English Task B. Vista.ue was able to rank 38/41 for Spanish Task A and 22/25 for Spanish Task B.

YNU NLP at SemEval-2019 Task 5: Attention and Capsule Ensemble for Identifying Hate Speech
Bin Wang | Haiyan Ding

This paper describes the system submitted to SemEval 2019 Task 5: Multilingual detection of hate speech against immigrants and women in Twitter (hatEval). Its main purpose is to conduct hate speech detection on Twitter, which mainly includes two specific different targets, immigrants and women. We participate in both subtask A and subtask B for English. In order to address this task, we develope an ensemble of an attention-LSTM model based on HAN and an BiGRU-capsule model. Both models use fastText pre-trained embeddings, and we use this model in both subtasks. In comparison to other participating teams, our system is ranked 16th in the Sub-task A for English, and 12th in the Sub-task B for English.

YNU_DYX at SemEval-2019 Task 5: A Stacked BiGRU Model Based on Capsule Network in Detection of Hate
Yunxia Ding | Xiaobing Zhou | Xuejie Zhang

This paper describes our system designed for SemEval 2019 Task 5 “Shared Task on Multilingual Detection of Hate”.We only participate in subtask-A in English. To address this task, we present a stacked BiGRU model based on a capsule network system. In or- der to convert the tweets into corresponding vector representations and input them into the neural network, we use the fastText tools to get word representations. Then, the sentence representation is enriched by stacked Bidirectional Gated Recurrent Units (BiGRUs) and used as the input of capsule network. Our system achieves an average F1-score of 0.546 and ranks 3rd in the subtask-A in English.

Amrita School of Engineering - CSE at SemEval-2019 Task 6: Manipulating Attention with Temporal Convolutional Neural Network for Offense Identification and Classification
Murali Sridharan | Swapna TR

With the proliferation and ubiquity of smart gadgets and smart devices, across the world, data generated by them has been growing at exponential rates; in particular social media platforms like Facebook, Twitter and Instagram have been generating voluminous data on a daily basis. According to Twitter’s usage statistics, about 500 million tweets are generated each day. While the tweets reflect the users’ opinions on several events across the world, there are tweets which are offensive in nature that need to be tagged under the hateful conduct policy of Twitter. Offensive tweets have to be identified, captured and processed further, for a variety of reasons, which include i) identifying offensive tweets in order to prevent violent/abusive behavior in Twitter (or any social media for that matter), ii) creating and maintaining a history of offensive tweets for individual users (would be helpful in creating meta-data for user profile), iii) inferring the sentiment of the users on particular event/issue/topic . We have employed neural network models which manipulate attention with Temporal Convolutional Neural Network for the three shared sub-tasks i) ATT-TCN (ATTention based Temporal Convolutional Neural Network) employed for shared sub-task A that yielded a best macro-F1 score of 0.46, ii) SAE-ATT-TCN(Self Attentive Embedding-ATTention based Temporal Convolutional Neural Network) employed for shared sub-task B and sub-task C that yielded best macro-F1 score of 0.61 and 0.51 respectively. Among the two variants ATT-TCN and SAE-ATT-TCN, the latter performed better.

bhanodaig at SemEval-2019 Task 6: Categorizing Offensive Language in social media
Ritesh Kumar | Guggilla Bhanodai | Rajendra Pamula | Maheswara Reddy Chennuru

This paper describes the work that our team bhanodaig did at Indian Institute of Technology (ISM) towards OffensEval i.e. identifying and categorizing offensive language in social media. Out of three sub-tasks, we have participated in sub-task B: automatic categorization of offensive types. We perform the task of categorizing offensive language, whether the tweet is targeted insult or untargeted. We use Linear Support Vector Machine for classification. The official ranking metric is macro-averaged F1. Our system gets the score 0.5282 with accuracy 0.8792. However, as new entrant to the field, our scores are encouraging enough to work for better results in future.

BNU-HKBU UIC NLP Team 2 at SemEval-2019 Task 6: Detecting Offensive Language Using BERT model
Zhenghao Wu | Hao Zheng | Jianming Wang | Weifeng Su | Jefferson Fong

In this study we deal with the problem of identifying and categorizing offensive language in social media. Our group, BNU-HKBU UIC NLP Team2, use supervised classification along with multiple version of data generated by different ways of pre-processing the data. We then use the state-of-the-art model Bidirectional Encoder Representations from Transformers, or BERT (Devlin et al, 2018), to capture linguistic, syntactic and semantic features. Long range dependencies between each part of a sentence can be captured by BERT’s bidirectional encoder representations. Our results show 85.12% accuracy and 80.57% F1 scores in Subtask A (offensive language identification), 87.92% accuracy and 50% F1 scores in Subtask B (categorization of offense types), and 69.95% accuracy and 50.47% F1 score in Subtask C (offense target identification). Analysis of the results shows that distinguishing between targeted and untargeted offensive language is not a simple task. More work needs to be done on the unbalance data problem in Subtasks B and C. Some future work is also discussed.

CAMsterdam at SemEval-2019 Task 6: Neural and graph-based feature extraction for the identification of offensive tweets
Guy Aglionby | Chris Davis | Pushkar Mishra | Andrew Caines | Helen Yannakoudakis | Marek Rei | Ekaterina Shutova | Paula Buttery

We describe the CAMsterdam team entry to the SemEval-2019 Shared Task 6 on offensive language identification in Twitter data. Our proposed model learns to extract textual features using a multi-layer recurrent network, and then performs text classification using gradient-boosted decision trees (GBDT). A self-attention architecture enables the model to focus on the most relevant areas in the text. In order to enrich input representations, we use node2vec to learn globally optimised embeddings for hashtags, which are then given as additional features to the GBDT classifier. Our best model obtains 78.79% macro F1-score on detecting offensive language (subtask A), 66.32% on categorising offence types (targeted/untargeted; subtask B), and 55.36% on identifying the target of offence (subtask C).

CN-HIT-MI.T at SemEval-2019 Task 6: Offensive Language Identification Based on BiLSTM with Double Attention
Yaojie Zhang | Bing Xu | Tiejun Zhao

Offensive language has become pervasive in social media. In Offensive Language Identification tasks, it may be difficult to predict accurately only according to the surface words. So we try to dig deeper semantic information of text. This paper presents use an attention-based two layers bidirectional longshort memory neural network (BiLSTM) for semantic feature extraction. Additionally, a residual connection mechanism is used to synthesize two different deep features, and an emoji attention mechanism is used to extract semantic information of emojis in text. We participated in three sub-tasks of SemEval 2019 Task 6 as CN-HIT-MI.T team. Our macro-averaged F1-score in sub-task A is 0.768, ranking 28/103. We got 0.638 in sub-task B, ranking 30/75. In sub-task C, we got 0.549, ranking 22/65. We also tried some other methods of not submitting results.

ConvAI at SemEval-2019 Task 6: Offensive Language Identification and Categorization with Perspective and BERT
John Pavlopoulos | Nithum Thain | Lucas Dixon | Ion Androutsopoulos

This paper presents the application of two strong baseline systems for toxicity detection and evaluates their performance in identifying and categorizing offensive language in social media. PERSPECTIVE is an API, that serves multiple machine learning models for the improvement of conversations online, as well as a toxicity detection system, trained on a wide variety of comments from platforms across the Internet. BERT is a recently popular language representation model, fine tuned per task and achieving state of the art performance in multiple NLP tasks. PERSPECTIVE performed better than BERT in detecting toxicity, but BERT was much better in categorizing the offensive type. Both baselines were ranked surprisingly high in the SEMEVAL-2019 OFFENSEVAL competition, PERSPECTIVE in detecting an offensive post (12th) and BERT in categorizing it (11th). The main contribution of this paper is the assessment of two strong baselines for the identification (PERSPECTIVE) and the categorization (BERT) of offensive language with little or no additional training data.

DA-LD-Hildesheim at SemEval-2019 Task 6: Tracking Offensive Content with Deep Learning using Shallow Representation
Sandip Modha | Prasenjit Majumder | Daksh Patel

This paper presents the participation of team DA-LD-Hildesheim of Information Retrieval and Language Processing lab at DA-IICT, India in Semeval-19 OffenEval track. The aim of this shared task is to identify offensive content at fined-grained level granularity. The task is divided into three sub-tasks. The system is required to check whether social media posts contain any offensive or profane content or not, targeted or untargeted towards any entity and classifying targeted posts into the individual, group or other categories. Social media posts suffer from data sparsity problem, Therefore, the distributed word representation technique is chosen over the Bag-of-Words for the text representation. Since limited labeled data was available for the training, pre-trained word vectors are used and fine-tuned on this classification task. Various deep learning models based on LSTM, Bidirectional LSTM, CNN, and Stacked CNN are used for the classification. It has been observed that labeled data was highly affected with class imbalance and our technique to handle the class-balance was not effective, in fact performance was degraded in some of the runs. Macro F1 score is used as a primary evaluation metric for the performance. Our System achieves Macro F1 score = 0.7833 in sub-task A, 0.6456 in the sub-task B and 0.5533 in the sub-task C.

DeepAnalyzer at SemEval-2019 Task 6: A deep learning-based ensemble method for identifying offensive tweets
Gretel Liz De la Peña | Paolo Rosso

This paper describes the system we developed for SemEval 2019 on Identifying and Categorizing Offensive Language in Social Media (OffensEval - Task 6). The task focuses on offensive language in tweets. It is organized into three sub-tasks for offensive language identification; automatic categorization of offense types and offense target identification. The approach for the first subtask is a deep learning-based ensemble method which uses a Bidirectional LSTM Recurrent Neural Network and a Convolutional Neural Network. Additionally we use the information from part-of-speech tagging of tweets for target identification and combine previous results for categorization of offense types.

NLP at SemEval-2019 Task 6: Detecting Offensive language using Neural Networks
Prashant Kapil | Asif Ekbal | Dipankar Das

In this paper we built several deep learning architectures to participate in shared task OffensEval: Identifying and categorizing Offensive language in Social media by semEval-2019. The dataset was annotated with three level annotation schemes and task was to detect between offensive and not offensive, categorization and target identification in offensive contents. Deep learning models with POS information as feature were also leveraged for classification. The three best models that performed best on individual sub tasks are stacking of CNN-Bi-LSTM with Attention, BiLSTM with POS information added with word features and Bi-LSTM for third task. Our models achieved a Macro F1 score of 0.7594, 0.5378 and 0.4588 in Task(A,B,C) respectively with rank of 33rd, 54th and 52nd out of 103, 75 and 65 submissions.The three best models that performed best on individual sub task are using Neural Networks.

Duluth at SemEval-2019 Task 6: Lexical Approaches to Identify and Categorize Offensive Tweets
Ted Pedersen

This paper describes the Duluth systems that participated in SemEval–2019 Task 6, Identifying and Categorizing Offensive Language in Social Media (OffensEval). For the most part these systems took traditional Machine Learning approaches that built classifiers from lexical features found in manually labeled training data. However, our most successful system for classifying a tweet as offensive (or not) was a rule-based black–list approach, and we also experimented with combining the training data from two different but related SemEval tasks. Our best systems in each of the three OffensEval tasks placed in the middle of the comparative evaluation, ranking 57th of 103 in task A, 39th of 75 in task B, and 44th of 65 in task C.

Emad at SemEval-2019 Task 6: Offensive Language Identification using Traditional Machine Learning and Deep Learning approaches
Emad Kebriaei | Samaneh Karimi | Nazanin Sabri | Azadeh Shakery

In this paper, the used methods and the results obtained by our team, entitled Emad, on the OffensEval 2019 shared task organized at SemEval 2019 are presented. The OffensEval shared task includes three sub-tasks namely Offensive language identification, Automatic categorization of offense types and Offense target identification. We participated in sub-task A and tried various methods including traditional machine learning methods, deep learning methods and also a combination of the first two sets of methods. We also proposed a data augmentation method using word embedding to improve the performance of our methods. The results show that the augmentation approach outperforms other methods in terms of macro-f1.

Embeddia at SemEval-2019 Task 6: Detecting Hate with Neural Network and Transfer Learning Approaches
Andraž Pelicon | Matej Martinc | Petra Kralj Novak

SemEval 2019 Task 6 was OffensEval: Identifying and Categorizing Offensive Language in Social Media. The task was further divided into three sub-tasks: offensive language identification, automatic categorization of offense types, and offense target identification. In this paper, we present the approaches used by the Embeddia team, who qualified as fourth, eighteenth and fifth on the tree sub-tasks. A different model was trained for each sub-task. For the first sub-task, we used a BERT model fine-tuned on the OLID dataset, while for the second and third tasks we developed a custom neural network architecture which combines bag-of-words features and automatically generated sequence-based features. Our results show that combining automatically and manually crafted features fed into a neural architecture outperform transfer learning approach on more unbalanced datasets.

Fermi at SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media using Sentence Embeddings
Vijayasaradhi Indurthi | Bakhtiyar Syed | Manish Shrivastava | Manish Gupta | Vasudeva Varma

This paper describes our system (Fermi) for Task 6: OffensEval: Identifying and Categorizing Offensive Language in Social Media of SemEval-2019. We participated in all the three sub-tasks within Task 6. We evaluate multiple sentence embeddings in conjunction with various supervised machine learning algorithms and evaluate the performance of simple yet effective embedding-ML combination algorithms. Our team Fermi’s model achieved an F1-score of 64.40%, 62.00% and 62.60% for sub-task A, B and C respectively on the official leaderboard. Our model for sub-task C which uses pre-trained ELMo embeddings for transforming the input and uses SVM (RBF kernel) for training, scored third position on the official leaderboard. Through the paper we provide a detailed description of the approach, as well as the results obtained for the task.

Ghmerti at SemEval-2019 Task 6: A Deep Word- and Character-based Approach to Offensive Language Identification
Ehsan Doostmohammadi | Hossein Sameti | Ali Saffar

This paper presents the models submitted by Ghmerti team for subtasks A and B of the OffensEval shared task at SemEval 2019. OffensEval addresses the problem of identifying and categorizing offensive language in social media in three subtasks; whether or not a content is offensive (subtask A), whether it is targeted (subtask B) towards an individual, a group, or other entities (subtask C). The proposed approach includes character-level Convolutional Neural Network, word-level Recurrent Neural Network, and some preprocessing. The performance achieved by the proposed model is 77.93% macro-averaged F1-score.

HAD-Tübingen at SemEval-2019 Task 6: Deep Learning Analysis of Offensive Language on Twitter: Identification and Categorization
Himanshu Bansal | Daniel Nagel | Anita Soloveva

This paper describes the submissions of our team, HAD-Tübingen, for the SemEval 2019 - Task 6: “OffensEval: Identifying and Categorizing Offensive Language in Social Media”. We participated in all the three sub-tasks: Sub-task A - “Offensive language identification”, sub-task B - “Automatic categorization of offense types” and sub-task C - “Offense target identification”. As a baseline model we used a Long short-term memory recurrent neural network (LSTM) to identify and categorize offensive tweets. For all the tasks we experimented with external databases in a postprocessing step to enhance the results made by our model. The best macro-average F1 scores obtained for the sub-tasks A, B and C are 0.73, 0.52, and 0.37, respectively.

HHU at SemEval-2019 Task 6: Context Does Matter - Tackling Offensive Language Identification and Categorization with ELMo
Alexander Oberstrass | Julia Romberg | Anke Stoll | Stefan Conrad

We present our results for OffensEval: Identifying and Categorizing Offensive Language in Social Media (SemEval 2019 - Task 6). Our results show that context embeddings are important features for the three different sub-tasks in connection with classical machine and with deep learning. Our best model reached place 3 of 75 in sub-task B with a macro F1 of 0.719. Our approaches for sub-task A and C perform less well but could also deliver promising results.

Hope at SemEval-2019 Task 6: Mining social media language to discover offensive language
Gabriel Florentin Patras | Diana Florina Lungu | Daniela Gifu | Diana Trandabat

User’s content share through social media has reached huge proportions nowadays. However, along with the free expression of thoughts on social media, people risk getting exposed to various aggressive statements. In this paper, we present a system able to identify and classify offensive user-generated content.

INGEOTEC at SemEval-2019 Task 5 and Task 6: A Genetic Programming Approach for Text Classification
Mario Graff | Sabino Miranda-Jiménez | Eric Tellez | Daniela Alejandra Ochoa

This paper describes our participation in HatEval and OffensEval challenges for English and Spanish languages. We used several approaches, B4MSA, FastText, and EvoMSA. Best results were achieved with EvoMSA, which is a multilingual and domain-independent architecture that combines the prediction of different knowledge sources to solve text classification problems.

JCTICOL at SemEval-2019 Task 6: Classifying Offensive Language in Social Media using Deep Learning Methods, Word/Character N-gram Features, and Preprocessing Methods
Yaakov HaCohen-Kerner | Ziv Ben-David | Gal Didi | Eli Cahn | Shalom Rochman | Elyashiv Shayovitz

In this paper, we describe our submissions to SemEval-2019 task 6 contest. We tackled all three sub-tasks in this task “OffensEval - Identifying and Categorizing Offensive Language in Social Media”. In our system called JCTICOL (Jerusalem College of Technology Identifies and Categorizes Offensive Language), we applied various supervised ML methods. We applied various combinations of word/character n-gram features using the TF-IDF scheme. In addition, we applied various combinations of seven basic preprocessing methods. Our best submission, an RNN model was ranked at the 25th position out of 65 submissions for the most complex sub-task (C).

jhan014 at SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media
Jiahui Han | Shengtan Wu | Xinyu Liu

In this paper, we present two methods to identify and categorize the offensive language in Twitter. In the first method, we establish a probabilistic model to evaluate the sentence offensiveness level and target level according to different sub-tasks. In the second method, we develop a deep neural network consisting of bidirectional recurrent layers with Gated Recurrent Unit (GRU) cells and fully connected layers. In the comparison of two methods, we find both method has its own advantages and drawbacks while they have similar accuracy.

JTML at SemEval-2019 Task 6: Offensive Tweets Identification using Convolutional Neural Networks
Johnny Torres | Carmen Vaca

In this paper, we propose the use of a Convolutional Neural Network (CNN) to identify offensive tweets, as well as the type and target of the offense. We use an end-to-end model (i.e., no preprocessing) and fine-tune pre-trained embeddings (FastText) during training for learning words’ representation. We compare the proposed CNN model to a baseline model, such as Linear Regression, and several neural models. The results show that CNN outperforms other models, and stands as a simple but strong baseline in comparison to other systems submitted to the Shared Task.

JU_ETCE_17_21 at SemEval-2019 Task 6: Efficient Machine Learning and Neural Network Approaches for Identifying and Categorizing Offensive Language in Tweets
Preeti Mukherjee | Mainak Pal | Somnath Banerjee | Sudip Kumar Naskar

This paper describes our system submissions as part of our participation (team name: JU_ETCE_17_21) in the SemEval 2019 shared task 6: “OffensEval: Identifying and Catego- rizing Offensive Language in Social Media”. We participated in all the three sub-tasks: i) Sub-task A: offensive language identification, ii) Sub-task B: automatic categorization of of- fense types, and iii) Sub-task C: offense target identification. We employed machine learn- ing as well as deep learning approaches for the sub-tasks. We employed Convolutional Neural Network (CNN) and Recursive Neu- ral Network (RNN) Long Short-Term Memory (LSTM) with pre-trained word embeddings. We used both word2vec and Glove pre-trained word embeddings. We obtained the best F1- score using CNN based model for sub-task A, LSTM based model for sub-task B and Lo- gistic Regression based model for sub-task C. Our best submissions achieved 0.7844, 0.5459 and 0.48 F1-scores for sub-task A, sub-task B and sub-task C respectively.

KMI-Coling at SemEval-2019 Task 6: Exploring N-grams for Offensive Language detection
Priya Rani | Atul Kr. Ojha

In this paper, we present the system description of Offensive language detection tool which is developed by the KMI_Coling under the OffensEval Shared task. The OffensEval Shared Task was conducted in SemEval 2019 workshop. To develop the system, we have explored n-grams up to 8-gram and trained three different namely A, B and C systems for three different subtasks within the OffensEval task which achieves 79.76%, 87.91% and 44.37% accuracy respectively. The task was completed using the dataset provided to us by the OffensEval organisers was the part of OLID dataset. It consists of 13,240 tweets extracted from twitter and were annotated at three levels using crowdsourcing.

LaSTUS/TALN at SemEval-2019 Task 6: Identification and Categorization of Offensive Language in Social Media with Attention-based Bi-LSTM model
Lutfiye Seda Mut Altin | Àlex Bravo Serrano | Horacio Saggion

We present a bidirectional Long-Short Term Memory network for identifying offensive language in Twitter. Our system has been developed in the context of the SemEval 2019 Task 6 which comprises three different sub-tasks, namely A: Offensive Language Detection, B: Categorization of Offensive Language, C: Offensive Language Target Identification. We used a pre-trained Word Embeddings in tweet data, including information about emojis and hashtags. Our approach achieves good performance in the three sub-tasks.

LTL-UDE at SemEval-2019 Task 6: BERT and Two-Vote Classification for Categorizing Offensiveness
Piush Aggarwal | Tobias Horsmann | Michael Wojatzki | Torsten Zesch

We present results for Subtask A and C of SemEval 2019 Shared Task 6. In Subtask A, we experiment with an embedding representation of postings and use BERT to categorize postings. Our best result reaches the 10th place (out of 103). In Subtask C, we applied a two-vote classification approach with minority fallback, which is placed on the 19th rank (out of 65).

MIDAS at SemEval-2019 Task 6: Identifying Offensive Posts and Targeted Offense from Twitter
Debanjan Mahata | Haimin Zhang | Karan Uppal | Yaman Kumar | Rajiv Ratn Shah | Simra Shahid | Laiba Mehnaz | Sarthak Anand

In this paper we present our approach and the system description for Sub Task A and Sub Task B of SemEval 2019 Task 6: Identifying and Categorizing Offensive Language in Social Media. Sub Task A involves identifying if a given tweet is offensive and Sub Task B involves detecting if an offensive tweet is targeted towards someone (group or an individual). Our models for Sub Task A is based on an ensemble of Convolutional Neural Network and Bidirectional LSTM, whereas for Sub Task B, we rely on a set of heuristics derived from the training data. We provide detailed analysis of the results obtained using the trained models. Our team ranked 5th out of 103 participants in Sub Task A, achieving a macro F1 score of 0.807, and ranked 8th out of 75 participants achieving a macro F1 of 0.695.

Nikolov-Radivchev at SemEval-2019 Task 6: Offensive Tweet Classification with BERT and Ensembles
Alex Nikolov | Victor Radivchev

This paper examines different approaches and models towards offensive tweet classification which were used as a part of the OffensEval 2019 competition. It reviews Tweet preprocessing, techniques for overcoming unbalanced class distribution in the provided test data, and comparison of multiple attempted machine learning models.

NIT_Agartala_NLP_Team at SemEval-2019 Task 6: An Ensemble Approach to Identifying and Categorizing Offensive Language in Twitter Social Media Corpora
Steve Durairaj Swamy | Anupam Jamatia | Björn Gambäck | Amitava Das

The paper describes the systems submitted to OffensEval (SemEval 2019, Task 6) on ‘Identifying and Categorizing Offensive Language in Social Media’ by the ‘NIT_Agartala_NLP_Team’. A Twitter annotated dataset of 13,240 English tweets was provided by the task organizers to train the individual models, with the best results obtained using an ensemble model composed of six different classifiers. The ensemble model produced macro-averaged F1-scores of 0.7434, 0.7078 and 0.4853 on Subtasks A, B, and C, respectively. The paper highlights the overall low predictive nature of various linguistic features and surface level count features, as well as the limitations of a traditional machine learning approach when compared to a Deep Learning counterpart.

NLP@UIOWA at SemEval-2019 Task 6: Classifying the Crass using Multi-windowed CNNs
Jonathan Rusert | Padmini Srinivasan

This paper proposes a system for OffensEval (SemEval 2019 Task 6), which calls for a system to classify offensive language into several categories. Our system is a text based CNN, which learns only from the provided training data. Our system achieves 80 - 90% accuracy for the binary classification problems (offensive vs not offensive and targeted vs untargeted) and 63% accuracy for trinary classification (group vs individual vs other).

NLPR@SRPOL at SemEval-2019 Task 6 and Task 5: Linguistically enhanced deep learning offensive sentence classifier
Alessandro Seganti | Helena Sobol | Iryna Orlova | Hannam Kim | Jakub Staniszewski | Tymoteusz Krumholc | Krystian Koziel

The paper presents a system developed for the SemEval-2019 competition Task 5 hat- Eval Basile et al. (2019) (team name: LU Team) and Task 6 OffensEval Zampieri et al. (2019b) (team name: NLPR@SRPOL), where we achieved 2nd position in Subtask C. The system combines in an ensemble several models (LSTM, Transformer, OpenAI’s GPT, Random forest, SVM) with various embeddings (custom, ELMo, fastText, Universal Encoder) together with additional linguistic features (number of blacklisted words, special characters, etc.). The system works with a multi-tier blacklist and a large corpus of crawled data, annotated for general offensiveness. In the paper we do an extensive analysis of our results and show how the combination of features and embedding affect the performance of the models.

nlpUP at SemEval-2019 Task 6: A Deep Neural Language Model for Offensive Language Detection
Jelena Mitrović | Bastian Birkeneder | Michael Granitzer

This paper presents our submission for the SemEval shared task 6, sub-task A on the identification of offensive language. Our proposed model, C-BiGRU, combines a Convolutional Neural Network (CNN) with a bidirectional Recurrent Neural Network (RNN). We utilize word2vec to capture the semantic similarities between words. This composition allows us to extract long term dependencies in tweets and distinguish between offensive and non-offensive tweets. In addition, we evaluate our approach on a different dataset and show that our model is capable of detecting online aggressiveness in both English and German tweets. Our model achieved a macro F1-score of 79.40% on the SemEval dataset.

Pardeep at SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media using Deep Learning
Pardeep Singh | Satish Chand

The rise of social media has made information exchange faster and easier among the people. However, in recent times, the use of offensive language has seen an upsurge in social media. The main challenge for a service provider is to correctly identify such offensive posts and take necessary action to monitor and control their spread. In this work, we try to address this problem by using sophisticated deep learning techniques like LSTM, Bidirectional LSTM and Bidirectional GRU. Our proposed approach solves 3 different Sub-tasks provided in the SemEval-2019 task 6 which incorporates identification of offensive tweets as well as their categorization. We obtain significantly better results in the leader-board for Sub-task B and decent results for Sub-task A and Subtask C validating the fact that the proposed models can be used for automating the offensive post-detection task in social media.

SINAI at SemEval-2019 Task 6: Incorporating lexicon knowledge into SVM learning to identify and categorize offensive language in social media
Flor Miriam Plaza-del-Arco | M. Dolores Molina-González | Maite Martin | L. Alfonso Ureña-López

Offensive language has an impact across society. The use of social media has aggravated this issue among online users, causing suicides in the worst cases. For this reason, it is important to develop systems capable of identifying and detecting offensive language in text automatically. In this paper, we developed a system to classify offensive tweets as part of our participation in SemEval-2019 Task 6: OffensEval. Our main contribution is the integration of lexical features in the classification using the SVM algorithm.

SSN_NLP at SemEval-2019 Task 6: Offensive Language Identification in Social Media using Traditional and Deep Machine Learning Approaches
Thenmozhi D. | Senthil Kumar B. | Srinethe Sharavanan | Aravindan Chandrabose

Offensive language identification (OLI) in user generated text is automatic detection of any profanity, insult, obscenity, racism or vulgarity that degrades an individual or a group. It is helpful for hate speech detection, flame detection and cyber bullying. Due to immense growth of accessibility to social media, OLI helps to avoid abuse and hurts. In this paper, we present deep and traditional machine learning approaches for OLI. In deep learning approach, we have used bi-directional LSTM with different attention mechanisms to build the models and in traditional machine learning, TF-IDF weighting schemes with classifiers namely Multinomial Naive Bayes and Support Vector Machines with Stochastic Gradient Descent optimizer are used for model building. The approaches are evaluated on the OffensEval@SemEval2019 dataset and our team SSN_NLP submitted runs for three tasks of OffensEval shared task. The best runs of SSN_NLP obtained the F1 scores as 0.53, 0.48, 0.3 and the accuracies as 0.63, 0.84 and 0.42 for the tasks A, B and C respectively. Our approaches improved the base line F1 scores by 12%, 26% and 14% for Task A, B and C respectively.

Stop PropagHate at SemEval-2019 Tasks 5 and 6: Are abusive language classification results reproducible?
Paula Fortuna | Juan Soler-Company | Sérgio Nunes

This paper summarizes the participation of Stop PropagHate team at SemEval 2019. Our approach is based on replicating one of the most relevant works on the literature, using word embeddings and LSTM. After circumventing some of the problems of the original code, we found poor results when applying it to the HatEval contest (F1=0.45). We think this is due mainly to inconsistencies in the data of this contest. Finally, for the OffensEval the classifier performed well (F1=0.74), proving to have a better performance for offense detection than for hate speech.

TECHSSN at SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Tweets using Deep Neural Networks
Angel Suseelan | Rajalakshmi S | Logesh B | Harshini S | Geetika B | Dyaneswaran S | S Milton Rajendram | Mirnalinee T T

Task 6 of SemEval 2019 involves identifying and categorizing offensive language in social media. The systems developed by TECHSSN team uses multi-level classification techniques. We have developed two systems. In the first system, the first level of classification is done by a multi-branch 2D CNN classifier with Google’s pre-trained Word2Vec embedding and the second level of classification by string matching technique supported by offensive and bad words dictionary. The second system uses a multi-branch 1D CNN classifier with Glove pre-trained embedding layer for the first level of classification and string matching for the second level of classification. Input data with a probability of less than 0.70 in the first level are passed on to the second level. The misclassified examples are classified correctly in the second level.

The Titans at SemEval-2019 Task 6: Offensive Language Identification, Categorization and Target Identification
Avishek Garain | Arpan Basu

This system paper is a description of the system submitted to “SemEval-2019 Task 6”, where we had to detect offensive language in Twitter. There were two specific target audiences, immigrants and women. The language of the tweets was English. We were required to first detect whether a tweet contains offensive content, and then we had to find out whether the tweet was targeted against some individual, group or other entity. Finally we were required to classify the targeted audience.

TüKaSt at SemEval-2019 Task 6: Something Old, Something Neu(ral): Traditional and Neural Approaches to Offensive Text Classification
Madeeswaran Kannan | Lukas Stein

We describe our system (TüKaSt) submitted for Task 6: Offensive Language Classification, at SemEval 2019. We developed multiple SVM classifier models that used sentence-level dense vector representations of tweets enriched with sentiment information and term-weighting. Our best results achieved F1 scores of 0.734, 0.660 and 0.465 in the first, second and third sub-tasks respectively. We also describe a neural network model that was developed in parallel but not used during evaluation due to time constraints.

TUVD team at SemEval-2019 Task 6: Offense Target Identification
Elena Shushkevich | John Cardiff | Paolo Rosso

This article presents our approach for detecting a target of offensive messages in Twitter, including Individual, Group and Others classes. The model we have created is an ensemble of simpler models, including Logistic Regression, Naive Bayes, Support Vector Machine and the interpolation between Logistic Regression and Naive Bayes with 0.25 coefficient of interpolation. The model allows us to achieve 0.547 macro F1-score.

UBC-NLP at SemEval-2019 Task 6: Ensemble Learning of Offensive Content With Enhanced Training Data
Arun Rajendran | Chiyu Zhang | Muhammad Abdul-Mageed

We examine learning offensive content on Twitter with limited, imbalanced data. For the purpose, we investigate the utility of using various data enhancement methods with a host of classical ensemble classifiers. Among the 75 participating teams in SemEval-2019 sub-task B, our system ranks 6th (with 0.706 macro F1-score). For sub-task C, among the 65 participating teams, our system ranks 9th (with 0.587 macro F1-score).

UHH-LT at SemEval-2019 Task 6: Supervised vs. Unsupervised Transfer Learning for Offensive Language Detection
Gregor Wiedemann | Eugen Ruppert | Chris Biemann

We present a neural network based approach of transfer learning for offensive language detection. For our system, we compare two types of knowledge transfer: supervised and unsupervised pre-training. Supervised pre-training of our bidirectional GRU-3-CNN architecture is performed as multi-task learning of parallel training of five different tasks. The selected tasks are supervised classification problems from public NLP resources with some overlap to offensive language such as sentiment detection, emoji classification, and aggressive language classification. Unsupervised transfer learning is performed with a thematic clustering of 40M unlabeled tweets via LDA. Based on this dataset, pre-training is performed by predicting the main topic of a tweet. Results indicate that unsupervised transfer from large datasets performs slightly better than supervised training on small ‘near target category’ datasets. In the SemEval Task, our system ranks 14 out of 103 participants.

UM-IU@LING at SemEval-2019 Task 6: Identifying Offensive Tweets Using BERT and SVMs
Jian Zhu | Zuoyu Tian | Sandra Kübler

This paper describes the UM-IU@LING’s system for the SemEval 2019 Task 6: Offens-Eval. We take a mixed approach to identify and categorize hate speech in social media. In subtask A, we fine-tuned a BERT based classifier to detect abusive content in tweets, achieving a macro F1 score of 0.8136 on the test data, thus reaching the 3rd rank out of 103 submissions. In subtasks B and C, we used a linear SVM with selected character n-gram features. For subtask C, our system could identify the target of abuse with a macro F1 score of 0.5243, ranking it 27th out of 65 submissions.

USF at SemEval-2019 Task 6: Offensive Language Detection Using LSTM With Word Embeddings
Bharti Goel | Ravi Sharma

In this paper, we present a system description for the SemEval-2019 Task 6 submitted by our team. For the task, our system takes tweet as an input and determine if the tweet is offensive or non-offensive (Sub-task A). In case a tweet is offensive, our system identifies if a tweet is targeted (insult or threat) or non-targeted like swearing (Sub-task B). In targeted tweets, our system identifies the target as an individual or group (Sub-task C). We used data pre-processing techniques like splitting hashtags into words, removing special characters, stop-word removal, stemming, lemmatization, capitalization, and offensive word dictionary. Later, we used keras tokenizer and word embeddings for feature extraction. For classification, we used the LSTM (Long short-term memory) model of keras framework. Our accuracy scores for Sub-task A, B and C are 0.8128, 0.8167 and 0.3662 respectively. Our results indicate that fine-grained classification to identify offense target was difficult for the system. Lastly, in the future scope section, we will discuss the ways to improve system performance.

UTFPR at SemEval-2019 Task 6: Relying on Compositionality to Find Offense
Gustavo Henrique Paetzold

We present the UTFPR system for the OffensEval shared task of SemEval 2019: A character-to-word-to-sentence compositional RNN model trained exclusively over the training data provided by the organizers. We find that, although not very competitive for the task at hand, it offers a robust solution to the orthographic irregularity inherent to tweets.

UVA Wahoos at SemEval-2019 Task 6: Hate Speech Identification using Ensemble Machine Learning
Murugesan Ramakrishnan | Wlodek Zadrozny | Narges Tabari

With the growth in the usage of social media, it has become increasingly common for people to hide behind a mask and abuse others. We have attempted to detect such tweets and comments that are malicious in intent, which either targets an individual or a group. Our best classifier for identifying offensive tweets for SubTask A (Classifying offensive vs. nonoffensive) has an accuracy of 83.14% and a f1- score of 0.7565 on the actual test data. For SubTask B, to identify if an offensive tweet is targeted (If targeted towards an individual or a group), the classifier performs with an accuracy of 89.17% and f1-score of 0.5885. The paper talks about how we generated linguistic and semantic features to build an ensemble machine learning model. By training with more extracts from different sources (Facebook, and more tweets), the paper shows how the accuracy changes with additional training data.

YNU-HPCC at SemEval-2019 Task 6: Identifying and Categorising Offensive Language on Twitter
Chengjin Zhou | Jin Wang | Xuejie Zhang

This document describes the submission of team YNU-HPCC to SemEval-2019 for three Sub-tasks of Task 6: Sub-task A, Sub-task B, and Sub-task C. We have submitted four systems to identify and categorise offensive language. The first subsystem is an attention-based 2-layer bidirectional long short-term memory (BiLSTM). The second subsystem is a voting ensemble of four different deep learning architectures. The third subsystem is a stacking ensemble of four different deep learning architectures. Finally, the fourth subsystem is a bidirectional encoder representations from transformers (BERT) model. Among our models, in Sub-task A, our first subsystem performed the best, ranking 16th among 103 teams; in Sub-task B, the second subsystem performed the best, ranking 12th among 75 teams; in Sub-task C, the fourth subsystem performed best, ranking 4th among 65 teams.

YNUWB at SemEval-2019 Task 6: K-max pooling CNN with average meta-embedding for identifying offensive language
Bin Wang | Xiaobing Zhou | Xuejie Zhang

This paper describes the system submitted to SemEval 2019 Task 6: OffensEval 2019. The task aims to identify and categorize offensive language in social media, we only participate in Sub-task A, which aims to identify offensive language. In order to address this task, we propose a system based on a K-max pooling convolutional neural network model, and use an argument for averaging as a valid meta-embedding technique to get a metaembedding. Finally, we also use a cyclic learning rate policy to improve model performance. Our model achieves a Macro F1-score of 0.802 (ranked 9/103) in the Sub-task A.

Zeyad at SemEval-2019 Task 6: That’s Offensive! An All-Out Search For An Ensemble To Identify And Categorize Offense in Tweets.
Zeyad El-Zanaty

The objective of this paper is to provide a description for a classification system built for SemEval-2019 Task 6: OffensEval. This system classifies a tweet as either offensive or not offensive (Sub-task A) and further classifies offensive tweets into categories (Sub-tasks B - C). The system consists of two phases; a brute-force grid search to find the best learners amongst a given set and an ensemble of a subset of these best learners. The system achieved an F1-score of 0.728, ranking in subtask A, an F1-score score of 0.616 in subtask B and an F1-score of 0.509 in subtask C.

SemEval-2019 Task 4: Hyperpartisan News Detection
Johannes Kiesel | Maria Mestre | Rishabh Shukla | Emmanuel Vincent | Payam Adineh | David Corney | Benno Stein | Martin Potthast

Hyperpartisan news is news that takes an extreme left-wing or right-wing standpoint. If one is able to reliably compute this meta information, news articles may be automatically tagged, this way encouraging or discouraging readers to consume the text. It is an open question how successfully hyperpartisan news detection can be automated, and the goal of this SemEval task was to shed light on the state of the art. We developed new resources for this purpose, including a manually labeled dataset with 1,273 articles, and a second dataset with 754,000 articles, labeled via distant supervision. The interest of the research community in our task exceeded all our expectations: The datasets were downloaded about 1,000 times, 322 teams registered, of which 184 configured a virtual machine on our shared task cloud service TIRA, of which in turn 42 teams submitted a valid run. The best team achieved an accuracy of 0.822 on a balanced sample (yes : no hyperpartisan) drawn from the manually tagged corpus; an ensemble of the submitted systems increased the accuracy by 0.048.

Team Bertha von Suttner at SemEval-2019 Task 4: Hyperpartisan News Detection using ELMo Sentence Representation Convolutional Network
Ye Jiang | Johann Petrak | Xingyi Song | Kalina Bontcheva | Diana Maynard

This paper describes the participation of team “bertha-von-suttner” in the SemEval2019 task 4 Hyperpartisan News Detection task. Our system uses sentence representations from averaged word embeddings generated from the pre-trained ELMo model with Convolutional Neural Networks and Batch Normalization for predicting hyperpartisan news. The final predictions were generated from the averaged predictions of an ensemble of models. With this architecture, our system ranked in first place, based on accuracy, the official scoring metric.

SemEval-2019 Task 7: RumourEval, Determining Rumour Veracity and Support for Rumours
Genevieve Gorrell | Elena Kochkina | Maria Liakata | Ahmet Aker | Arkaitz Zubiaga | Kalina Bontcheva | Leon Derczynski

Since the first RumourEval shared task in 2017, interest in automated claim validation has greatly increased, as the danger of “fake news” has become a mainstream concern. However automated support for rumour verification remains in its infancy. It is therefore important that a shared task in this area continues to provide a focus for effort, which is likely to increase. Rumour verification is characterised by the need to consider evolving conversations and news updates to reach a verdict on a rumour’s veracity. As in RumourEval 2017 we provided a dataset of dubious posts and ensuing conversations in social media, annotated both for stance and veracity. The social media rumours stem from a variety of breaking news stories and the dataset is expanded to include Reddit as well as new Twitter posts. There were two concrete tasks; rumour stance prediction and rumour verification, which we present in detail along with results achieved by participants. We received 22 system submissions (a 70% increase from RumourEval 2017) many of which used state-of-the-art methodology to tackle the challenges involved.

eventAI at SemEval-2019 Task 7: Rumor Detection on Social Media by Exploiting Content, User Credibility and Propagation Information
Quanzhi Li | Qiong Zhang | Luo Si

This paper describes our system for SemEval 2019 RumorEval: Determining rumor veracity and support for rumors (SemEval 2019 Task 7). This track has two tasks: Task A is to determine a user’s stance towards the source rumor, and Task B is to detect the veracity of the rumor: true, false or unverified. For stance classification, a neural network model with language features is utilized. For rumor verification, our approach exploits information from different dimensions: rumor content, source credibility, user credibility, user stance, event propagation path, etc. We use an ensemble approach in both tasks, which includes neural network models as well as the traditional classification algorithms. Our system is ranked 1st place in the rumor verification task by both the macro F1 measure and the RMSE measure.

SemEval-2019 Task 8: Fact Checking in Community Question Answering Forums
Tsvetomila Mihaylova | Georgi Karadzhov | Pepa Atanasova | Ramy Baly | Mitra Mohtarami | Preslav Nakov

We present SemEval-2019 Task 8 on Fact Checking in Community Question Answering Forums, which features two subtasks. Subtask A is about deciding whether a question asks for factual information vs. an opinion/advice vs. just socializing. Subtask B asks to predict whether an answer to a factual question is true, false or not a proper answer. We received 17 official submissions for subtask A and 11 official submissions for Subtask B. For subtask A, all systems improved over the majority class baseline. For Subtask B, all systems were below a majority class baseline, but several systems were very close to it. The leaderboard and the data from the competition can be found at

AUTOHOME-ORCA at SemEval-2019 Task 8: Application of BERT for Fact-Checking in Community Forums
Zhengwei Lv | Duoxing Liu | Haifeng Sun | Xiao Liang | Tao Lei | Zhizhong Shi | Feng Zhu | Lei Yang

Fact checking is an important task for maintaining high quality posts and improving user experience in Community Question Answering forums. Therefore, the SemEval-2019 task 8 is aimed to identify factual question (subtask A) and detect true factual information from corresponding answers (subtask B). In order to address this task, we propose a system based on the BERT model with meta information of questions. For the subtask A, the outputs of fine-tuned BERT classification model are combined with the feature of length of questions to boost the performance. For the subtask B, the predictions of several variants of BERT model encoding the meta information are combined to create an ensemble model. Our system achieved competitive results with an accuracy of 0.82 in the subtask A and 0.83 in the subtask B. The experimental results validate the effectiveness of our system.

SemEval-2019 Task 9: Suggestion Mining from Online Reviews and Forums
Sapna Negi | Tobias Daudert | Paul Buitelaar

We present the pilot SemEval task on Suggestion Mining. The task consists of subtasks A and B, where we created labeled data from feedback forum and hotel reviews respectively. Subtask A provides training and test data from the same domain, while Subtask B evaluates the system on a test dataset from a different domain than the available training data. 33 teams participated in the shared task, with a total of 50 members. We summarize the problem definition, benchmark dataset preparation, and methods used by the participating teams, providing details of the methods used by the top ranked systems. The dataset is made freely available to help advance the research in suggestion mining, and reproduce the systems submitted under this task

m_y at SemEval-2019 Task 9: Exploring BERT for Suggestion Mining
Masahiro Yamamoto | Toshiyuki Sekiya

This paper presents our system to the SemEval-2019 Task 9, Suggestion Mining from Online Reviews and Forums. The goal of this task is to extract suggestions such as the expressions of tips, advice, and recommendations. We explore Bidirectional Encoder Representations from Transformers (BERT) focusing on target domain pre-training in Subtask A which provides training and test datasets in the same domain. In Subtask B, the cross domain suggestion mining task, we apply the idea of distant supervision. Our system obtained the third place in Subtask A and the fifth place in Subtask B, which demonstrates its efficacy of our approaches.

SemEval-2019 Task 10: Math Question Answering
Mark Hopkins | Ronan Le Bras | Cristian Petrescu-Prahova | Gabriel Stanovsky | Hannaneh Hajishirzi | Rik Koncel-Kedziorski

We report on the SemEval 2019 task on math question answering. We provided a question set derived from Math SAT practice exams, including 2778 training questions and 1082 test questions. For a significant subset of these questions, we also provided SMT-LIB logical form annotations and an interpreter that could solve these logical forms. Systems were evaluated based on the percentage of correctly answered questions. The top system correctly answered 45% of the test questions, a considerable improvement over the 17% random guessing baseline.

AiFu at SemEval-2019 Task 10: A Symbolic and Sub-symbolic Integrated System for SAT Math Question Answering
Yifan Liu | Keyu Ding | Yi Zhou

AiFu has won the first place in the SemEval-2019 Task 10 - ”Math Question Answering”competition. This paper is to describe how it works technically and to report and analyze some essential experimental results

SemEval-2019 Task 12: Toponym Resolution in Scientific Papers
Davy Weissenbacher | Arjun Magge | Karen O’Connor | Matthew Scotch | Graciela Gonzalez-Hernandez

We present the SemEval-2019 Task 12 which focuses on toponym resolution in scientific articles. Given an article from PubMed, the task consists of detecting mentions of names of places, or toponyms, and mapping the mentions to their corresponding entries in, a database of geospatial locations. We proposed three subtasks. In Subtask 1, we asked participants to detect all toponyms in an article. In Subtask 2, given toponym mentions as input, we asked participants to disambiguate them by linking them to entries in GeoNames. In Subtask 3, we asked participants to perform both the detection and the disambiguation steps for all toponyms. A total of 29 teams registered, and 8 teams submitted a system run. We summarize the corpus and the tools created for the challenge. They are freely available at We also analyze the methods, the results and the errors made by the competing systems with a focus on toponym disambiguation.

DM_NLP at SemEval-2018 Task 12: A Pipeline System for Toponym Resolution
Xiaobin Wang | Chunping Ma | Huafei Zheng | Chu Liu | Pengjun Xie | Linlin Li | Luo Si

This paper describes DM-NLP’s system for toponym resolution task at Semeval 2019. Our system was developed for toponym detection, disambiguation and end-to-end resolution which is a pipeline of the former two. For toponym detection, we utilized the state-of-the-art sequence labeling model, namely, BiLSTM-CRF model as backbone. A lot of strategies are adopted for further improvement, such as pre-training, model ensemble, model averaging and data augment. For toponym disambiguation, we adopted the widely used searching and ranking framework. For ranking, we proposed several effective features for measuring the consistency between the detected toponym and toponyms in GeoNames. Eventually, our system achieved the best performance among all the submitted results in each sub task.

Brenda Starr at SemEval-2019 Task 4: Hyperpartisan News Detection
Olga Papadopoulou | Giorgos Kordopatis-Zilos | Markos Zampoglou | Symeon Papadopoulos | Yiannis Kompatsiaris

In the effort to tackle the challenge of Hyperpartisan News Detection, i.e., the task of deciding whether a news article is biased towards one party, faction, cause, or person, we experimented with two systems: i) a standard supervised learning approach using superficial text and bag-of-words features from the article title and body, and ii) a deep learning system comprising a four-layer convolutional neural network and max-pooling layers after the embedding layer, feeding the consolidated features to a bi-directional recurrent neural network. We achieved an F-score of 0.712 with our best approach, which corresponds to the mid-range of performance levels in the leaderboard.

Cardiff University at SemEval-2019 Task 4: Linguistic Features for Hyperpartisan News Detection
Carla Pérez-Almendros | Luis Espinosa-Anke | Steven Schockaert

This paper summarizes our contribution to the Hyperpartisan News Detection task in SemEval 2019. We experiment with two different approaches: 1) an SVM classifier based on word vector averages and hand-crafted linguistic features, and 2) a BiLSTM-based neural text classifier trained on a filtered training set. Surprisingly, despite their different nature, both approaches achieve an accuracy of 0.74. The main focus of this paper is to further analyze the remarkable fact that a simple feature-based approach can perform on par with modern neural classifiers. We also highlight the effectiveness of our filtering strategy for training the neural network on a large but noisy training set.

Clark Kent at SemEval-2019 Task 4: Stylometric Insights into Hyperpartisan News Detection
Viresh Gupta | Baani Leen Kaur Jolly | Ramneek Kaur | Tanmoy Chakraborty

In this paper, we present a news bias prediction system, which we developed as part of a SemEval 2019 task. We developed an XGBoost based system which uses character and word level n-gram features represented using TF-IDF, count vector based correlation matrix, and predicts if an input news article is a hyperpartisan news article. Our model was able to achieve a precision of 68.3% on the test set provided by the contest organizers. We also run our model on the BuzzFeed corpus and find XGBoost with simple character level N-Gram embeddings to be performing well with an accuracy of around 96%.

Dick-Preston and Morbo at SemEval-2019 Task 4: Transfer Learning for Hyperpartisan News Detection
Tim Isbister | Fredrik Johansson

In a world of information operations, influence campaigns, and fake news, classification of news articles as following hyperpartisan argumentation or not is becoming increasingly important. We present a deep learning-based approach in which a pre-trained language model has been fine-tuned on domain-specific data and used for classification of news articles, as part of the SemEval-2019 task on hyperpartisan news detection. The suggested approach yields accuracy and F1-scores around 0.8 which places the best performing classifier among the top-5 systems in the competition.

Doris Martin at SemEval-2019 Task 4: Hyperpartisan News Detection with Generic Semi-supervised Features
Rodrigo Agerri

In this paper we describe our participation to the Hyperpartisan News Detection shared task at SemEval 2019. Motivated by the late arrival of Doris Martin, we test a previously developed document classification system which consists of a combination of clustering features implemented on top of some simple shallow local features. We show how leveraging distributional features obtained from large in-domain unlabeled data helps to easily and quickly develop a reasonably good performing system for detecting hyperpartisan news. The system and models generated for this task are publicly available.

Duluth at SemEval-2019 Task 4: The Pioquinto Manterola Hyperpartisan News Detector
Saptarshi Sengupta | Ted Pedersen

This paper describes the Pioquinto Manterola Hyperpartisan News Detector, which participated in SemEval-2019 Task 4. Hyperpartisan news is highly polarized and takes a very biased or one–sided view of a particular story. We developed two variants of our system, the more successful was a Logistic Regression classifier based on unigram features. This was our official entry in the task, and it placed 23rd of 42 participating teams. Our second variant was a Convolutional Neural Network that did not perform as well.

Fermi at SemEval-2019 Task 4: The sarah-jane-smith Hyperpartisan News Detector
Nikhil Chakravartula | Vijayasaradhi Indurthi | Bakhtiyar Syed

This paper describes our system (Fermi) for Task 4: Hyper-partisan News detection of SemEval-2019. We use simple text classification algorithms by transforming the input features to a reduced feature set. We aim to find the right number of features useful for efficient classification and explore multiple training models to evaluate the performance of these text classification algorithms. Our team - Fermi’s model achieved an accuracy of 59.10% and an F1 score of 69.5% on the official test data set. In this paper, we provide a detailed description of the approach as well as the results obtained in the task.

Harvey Mudd College at SemEval-2019 Task 4: The Carl Kolchak Hyperpartisan News Detector
Celena Chen | Celine Park | Jason Dwyer | Julie Medero

We use various natural processing and machine learning methods to perform the Hyperpartisan News Detection task. In particular, some of the features we look at are bag-of-words features, the title’s length, number of capitalized words in the title, and the sentiment of the sentences and the title. By adding these features, we see improvements in our evaluation metrics compared to the baseline values. We find that sentiment analysis helps improve our evaluation metrics. We do not see a benefit from feature selection. Overall, our system achieves an accuracy of 0.739, finishing 18th out of 42 submissions to the task. From our work, it is evident that both title features and sentiment of articles are meaningful to the hyperpartisanship of news articles.

Harvey Mudd College at SemEval-2019 Task 4: The Clint Buchanan Hyperpartisan News Detector
Mehdi Drissi | Pedro Sandoval Segura | Vivaswat Ojha | Julie Medero

We investigate the recently developed Bidi- rectional Encoder Representations from Transformers (BERT) model (Devlin et al. 2018) for the hyperpartisan news detection task. Using a subset of hand-labeled articles from SemEval as a validation set, we test the performance of different parameters for BERT models. We find that accuracy from two different BERT models using different proportions of the articles is consistently high, with our best-performing model on the validation set achieving 85% accuracy and the best-performing model on the test set achieving 77%. We further determined that our model exhibits strong consistency, labeling independent slices of the same article identically. Finally, we find that randomizing the order of word pieces dramatically reduces validation accuracy (to approximately 60%), but that shuffling groups of four or more word pieces maintains an accuracy of about 80%, indicating the model mainly gains value from local context.

Harvey Mudd College at SemEval-2019 Task 4: The D.X. Beaumont Hyperpartisan News Detector
Evan Amason | Jake Palanker | Mary Clare Shen | Julie Medero

We use the 600 hand-labelled articles from SemEval Task 4 to hand-tune a classifier with 3000 features for the Hyperpartisan News Detection task. Our final system uses features based on bag-of-words (BoW), analysis of the article title, language complexity, and simple sentiment analysis in a naive Bayes classifier. We trained our final system on the 600,000 articles labelled by publisher. Our final system has an accuracy of 0.653 on the hand-labeled test set. The most effective features are the Automated Readability Index and the presence of certain words in the title. This suggests that hyperpartisan writing uses a distinct writing style, especially in the title.

NLP@UIT at SemEval-2019 Task 4: The Paparazzo Hyperpartisan News Detector
Duc-Vu Nguyen | Thin Dang | Ngan Nguyen

This paper describes the system of NLP@UIT that participated in Task 4 of SemEval-2019. We developed a system that predicts whether an English news article follows a hyperpartisan argumentation. Paparazzo is the name of our system and is also the code name of our team in Task 4 of SemEval-2019. The Paparazzo system, in which we use tri-grams of words and hepta-grams of characters, officially ranks thirteen with an accuracy of 0.747. Another system of ours, which utilizes trigrams of words, tri-grams of characters, trigrams of part-of-speech, syntactic dependency sub-trees, and named-entity recognition tags, achieved an accuracy of 0.787 and is proposed after the deadline of Task 4.

Orwellian-times at SemEval-2019 Task 4: A Stylistic and Content-based Classifier
Jürgen Knauth

While fake news detection received quite a bit of attention in recent years, hyperpartisan news detection is still an underresearched topic. This paper presents our work towards building a classification system for hyperpartisan news detection in the context of the SemEval2019 shared task 4. We experiment with two different approaches - a more stylistic one, and a more content related one - achieving average results.

Rouletabille at SemEval-2019 Task 4: Neural Network Baseline for Identification of Hyperpartisan Publishers
Jose G. Moreno | Yoann Pitarch | Karen Pinel-Sauvagnat | Gilles Hubert

This paper describes the Rouletabille participation to the Hyperpartisan News Detection task. We propose the use of different text classification methods for this task. Preliminary experiments using a similar collection used in (Potthast et al., 2018) show that neural-based classification methods reach state-of-the art results. Our final submission is composed of a unique run that ranks among all runs at 3/49 position for the by-publisher test dataset and 43/96 for the by-article test dataset in terms of Accuracy.

Spider-Jerusalem at SemEval-2019 Task 4: Hyperpartisan News Detection
Amal Alabdulkarim | Tariq Alhindi

This paper describes our system for detecting hyperpartisan news articles, which was submitted for the shared task in SemEval 2019 on Hyperpartisan News Detection. We developed a Support Vector Machine (SVM) model that uses TF-IDF of tokens, Language Inquiry and Word Count (LIWC) features, and structural features such as number of paragraphs and hyperlink count in an article. The model was trained on 645 articles from two classes: mainstream and hyperpartisan. Our system was ranked seventeenth out of forty two participating teams in the binary classification task with an accuracy score of 0.742 on the blind test set (the accuracy of the top ranked system was 0.822). We provide a detailed description of our preprocessing steps, discussion of our experiments using different combinations of features, and analysis of our results and prediction errors.

Steve Martin at SemEval-2019 Task 4: Ensemble Learning Model for Detecting Hyperpartisan News
Youngjun Joo | Inchon Hwang

This paper describes our submission to task 4 in SemEval 2019, i.e., hyperpartisan news detection. Our model aims at detecting hyperpartisan news by incorporating the style-based features and the content-based features. We extract a broad number of feature sets and use as our learning algorithms the GBDT and the n-gram CNN model. Finally, we apply the weighted average for effective learning between the two models. Our model achieves an accuracy of 0.745 on the test set in subtask A.

TakeLab at SemEval-2019 Task 4: Hyperpartisan News Detection
Niko Palić | Juraj Vladika | Dominik Čubelić | Ivan Lovrenčić | Maja Buljan | Jan Šnajder

In this paper, we demonstrate the system built to solve the SemEval-2019 task 4: Hyperpartisan News Detection (Kiesel et al., 2019), the task of automatically determining whether an article is heavily biased towards one side of the political spectrum. Our system receives an article in its raw, textual form, analyzes it, and predicts with moderate accuracy whether the article is hyperpartisan. The learning model used was primarily trained on a manually prelabeled dataset containing news articles. The system relies on the previously constructed SVM model, available in the Python Scikit-Learn library. We ranked 6th in the competition of 42 teams with an accuracy of 79.1% (the winning team had 82.2%).

Team Fernando-Pessa at SemEval-2019 Task 4: Back to Basics in Hyperpartisan News Detection
André Cruz | Gil Rocha | Rui Sousa-Silva | Henrique Lopes Cardoso

This paper describes our submission to the SemEval 2019 Hyperpartisan News Detection task. Our system aims for a linguistics-based document classification from a minimal set of interpretable features, while maintaining good performance. To this goal, we follow a feature-based approach and perform several experiments with different machine learning classifiers. Additionally, we explore feature importances and distributions among the two classes. On the main task, our model achieved an accuracy of 71.7%, which was improved after the task’s end to 72.9%. We also participate on the meta-learning sub-task, for classifying documents with the binary classifications of all submitted systems as input, achieving an accuracy of 89.9%.

Team Harry Friberg at SemEval-2019 Task 4: Identifying Hyperpartisan News through Editorially Defined Metatopics
Nazanin Afsarmanesh | Jussi Karlgren | Peter Sumbler | Nina Viereckel

This report describes the starting point for a simple rule based hypothesis testing excercise on identifying hyperpartisan news items carried out by the Harry Friberg team from Gavagai. We used manually crafted metatopics, topics which often appear in hyperpartisan texts as rant conduits, together with tonality analysis to identify general characteristics of hyperpartisan news items. While the precision of the resulting effort is less than stellar— our contribution ranked 37th of the 42 successfully submitted experiments with overly high recall (95%) and low precision (54%)—we believe we have a model which allows us to continue exploring the underlying features of what the subgenre of hyperpartisan news items is characterised by.

Team Howard Beale at SemEval-2019 Task 4: Hyperpartisan News Detection with BERT
Osman Mutlu | Ozan Arkan Can | Erenay Dayanik

This paper describes our system for SemEval-2019 Task 4: Hyperpartisan News Detection (Kiesel et al., 2019). We use pretrained BERT (Devlin et al., 2018) architecture and investigate the effect of different fine tuning regimes on the final classification task. We show that additional pretraining on news domain improves the performance on the Hyperpartisan News Detection task. Our system ranked 8th out of 42 teams with 78.3% accuracy on the held-out test dataset.

Team Jack Ryder at SemEval-2019 Task 4: Using BERT Representations for Detecting Hyperpartisan News
Daniel Shaprin | Giovanni Da San Martino | Alberto Barrón-Cedeño | Preslav Nakov

We describe the system submitted by the Jack Ryder team to SemEval-2019 Task 4 on Hyperpartisan News Detection. The task asked participants to predict whether a given article is hyperpartisan, i.e., extreme-left or extreme-right. We proposed an approach based on BERT with fine-tuning, which was ranked 7th out 28 teams on the distantly supervised dataset, where all articles from a hyperpartisan/non-hyperpartisan news outlet are considered to be hyperpartisan/non-hyperpartisan. On a manually annotated test dataset, where human annotators double-checked the labels, we were ranked 29th out of 42 teams.

Team Kermit-the-frog at SemEval-2019 Task 4: Bias Detection Through Sentiment Analysis and Simple Linguistic Features
Talita Anthonio | Lennart Kloppenburg

In this paper we describe our participation in the SemEval 2019 shared task on hyperpartisan news detection. We present the system that we submitted for final evaluation and the three approaches that we used: sentiment, bias-laden words and filtered n-gram features. Our submitted model is a Linear SVM that solely relies on the negative sentiment of a document. We achieved an accuracy of 0.621 and a f1 score of 0.694 in the competition, revealing the predictive power of negative sentiment for this task. There was no major improvement by adding or substituting the features of the other two approaches that we tried.

Team Kit Kittredge at SemEval-2019 Task 4: LSTM Voting System
Rebekah Cramerus | Tatjana Scheffler

This paper describes the approach of team Kit Kittredge to SemEval-2019 Task 4: Hyperpartisan News Detection. The goal was binary classification of news articles into the categories of “biased” or “unbiased”. We had two software submissions: one a simple bag-of-words model, and the second an LSTM (Long Short Term Memory) neural network, which was trained on a subset of the original dataset selected by a voting system of other LSTMs. This method did not prove much more successful than the baseline, however, due to the models’ tendency to learn publisher-specific traits instead of general bias.

Team Ned Leeds at SemEval-2019 Task 4: Exploring Language Indicators of Hyperpartisan Reporting
Bozhidar Stevanoski | Sonja Gievska

This paper reports an experiment carried out to investigate the relevance of several syntactic, stylistic and pragmatic features on the task of distinguishing between mainstream and partisan news articles. The results of the evaluation of different feature sets and the extent to which various feature categories could affect the performance metrics are discussed and compared. Among different combinations of features and classifiers, Random Forest classifier using vector representations of the headline and the text of the report, with the inclusion of 8 readability scores and few stylistic features yielded best result, ranking our team at the 9th place at the SemEval 2019 Hyperpartisan News Detection challenge.

Team Peter Brinkmann at SemEval-2019 Task 4: Detecting Biased News Articles Using Convolutional Neural Networks
Michael Färber | Agon Qurdina | Lule Ahmedi

In this paper, we present an approach for classifying news articles as biased (i.e., hyperpartisan) or unbiased, based on a convolutional neural network. We experiment with various embedding methods (pretrained and trained on the training dataset) and variations of the convolutional neural network architecture and compare the results. When evaluating our best performing approach on the actual test data set of the SemEval 2019 Task 4, we obtained relatively low precision and accuracy values, while gaining the highest recall rate among all 42 participating teams.

Team Peter-Parker at SemEval-2019 Task 4: BERT-Based Method in Hyperpartisan News Detection
Zhiyuan Ning | Yuanzhen Lin | Ruichao Zhong

This paper describes the team peter-parker’s participation in Hyperpartisan News Detection task (SemEval-2019 Task 4), which requires to classify whether a given news article is bias or not. We decided to use JAVA to do the article parsing tool and the BERT-Based model to do the bias prediction. Furthermore, we will show experiment results with analysis.

Team QCRI-MIT at SemEval-2019 Task 4: Propaganda Analysis Meets Hyperpartisan News Detection
Abdelrhman Saleh | Ramy Baly | Alberto Barrón-Cedeño | Giovanni Da San Martino | Mitra Mohtarami | Preslav Nakov | James Glass

We describe our submission to SemEval-2019 Task 4 on Hyperpartisan News Detection. We rely on a variety of engineered features originally used to detect propaganda. This is based on the assumption that biased messages are propagandistic and promote a particular political cause or viewpoint. In particular, we trained a logistic regression model with features ranging from simple bag of words to vocabulary richness and text readability. Our system achieved 72.9% accuracy on the manually annotated testset, and 60.8% on the test data that was obtained with distant supervision. Additional experiments showed that significant performance gains can be achieved with better feature pre-processing.

Team Xenophilius Lovegood at SemEval-2019 Task 4: Hyperpartisanship Classification using Convolutional Neural Networks
Albin Zehe | Lena Hettinger | Stefan Ernst | Christian Hauptmann | Andreas Hotho

This paper describes our system for the SemEval 2019 Task 4 on hyperpartisan news detection. We build on an existing deep learning approach for sentence classification based on a Convolutional Neural Network. Modifying the original model with additional layers to increase its expressiveness and finally building an ensemble of multiple versions of the model, we obtain an accuracy of 67.52% and an F1 score of 73.78% on the main test dataset. We also report on additional experiments incorporating handcrafted features into the CNN and using it as a feature extractor for a linear SVM.

Team yeon-zi at SemEval-2019 Task 4: Hyperpartisan News Detection by De-noising Weakly-labeled Data
Nayeon Lee | Zihan Liu | Pascale Fung

This paper describes our system that has been submitted to SemEval-2019 Task 4: Hyperpartisan News Detection. We focus on removing the noise inherent in the hyperpartisanship dataset from both data-level and model-level by leveraging semi-supervised pseudo-labels and the state-of-the-art BERT model. Our model achieves 75.8% accuracy in the final by-article dataset without ensemble learning.

The Sally Smedley Hyperpartisan News Detector at SemEval-2019 Task 4
Kazuaki Hanawa | Shota Sasaki | Hiroki Ouchi | Jun Suzuki | Kentaro Inui

This paper describes our system submitted to the formal run of SemEval-2019 Task 4: Hyperpartisan news detection. Our system is based on a linear classifier using several features, i.e., 1) embedding features based on the pre-trained BERT embeddings, 2) article length features, and 3) embedding features of informative phrases extracted from by-publisher dataset. Our system achieved 80.9% accuracy on the test set for the formal run and got the 3rd place out of 42 teams.

Tintin at SemEval-2019 Task 4: Detecting Hyperpartisan News Article with only Simple Tokens
Yves Bestgen

Tintin, the system proposed by the CECL for the Hyperpartisan News Detection task of SemEval 2019, is exclusively based on the tokens that make up the documents and a standard supervised learning procedure. It obtained very contrasting results: poor on the main task, but much more effective at distinguishing documents published by hyperpartisan media outlets from unbiased ones, as it ranked first. An analysis of the most important features highlighted the positive aspects, but also some potential limitations of the approach.

Tom Jumbo-Grumbo at SemEval-2019 Task 4: Hyperpartisan News Detection with GloVe vectors and SVM
Chia-Lun Yeh | Babak Loni | Anne Schuth

In this paper, we describe our attempt to learn bias from news articles. From our experiments, it seems that although there is a correlation between publisher bias and article bias, it is challenging to learn bias directly from the publisher labels. On the other hand, using few manually-labeled samples can increase the accuracy metric from around 60% to near 80%. Our system is computationally inexpensive and uses several standard document representations in NLP to train an SVM or LR classifier. The system ranked 4th in the SemEval-2019 task. The code is released for reproducibility.

UBC-NLP at SemEval-2019 Task 4: Hyperpartisan News Detection With Attention-Based Bi-LSTMs
Chiyu Zhang | Arun Rajendran | Muhammad Abdul-Mageed

We present our deep learning models submitted to the SemEval-2019 Task 4 competition focused at Hyperpartisan News Detection. We acquire best results with a Bi-LSTM network equipped with a self-attention mechanism. Among 33 participating teams, our submitted system ranks top 7 (65.3% accuracy) on the ‘labels-by-publisher’ sub-task and top 24 out of 44 teams (68.3% accuracy) on the ‘labels-by-article’ sub-task (65.3% accuracy). We also report a model that scores higher than the 8th ranking system (78.5% accuracy) on the ‘labels-by-article’ sub-task.

Vernon-fenwick at SemEval-2019 Task 4: Hyperpartisan News Detection using Lexical and Semantic Features
Vertika Srivastava | Ankita Gupta | Divya Prakash | Sudeep Kumar Sahoo | Rohit R.R | Yeon Hyang Kim

In this paper, we present our submission for SemEval-2019 Task 4: Hyperpartisan News Detection. Hyperpartisan news articles are sharply polarized and extremely biased (onesided). It shows blind beliefs, opinions and unreasonable adherence to a party, idea, faction or a person. Through this task, we aim to develop an automated system that can be used to detect hyperpartisan news and serve as a prescreening technique for fake news detection. The proposed system jointly uses a rich set of handcrafted textual and semantic features. Our system achieved 2nd rank on the primary metric (82.0% accuracy) and 1st rank on the secondary metric (82.1% F1-score), among all participating teams. Comparison with the best performing system on the leaderboard shows that our system is behind by only 0.2% absolute difference in accuracy.

AndrejJan at SemEval-2019 Task 7: A Fusion Approach for Exploring the Key Factors pertaining to Rumour Analysis
Andrej Janchevski | Sonja Gievska

The viral spread of false, unverified and misleading information on the Internet has attracted a heightened attention of an interdisciplinary research community on the phenomenon. This paper contributes to the research efforts of automatically determining the veracity of rumourous tweets and classifying their replies according to stance. Our research objective was to investigate the interplay between a number of phenomenological and contextual features of rumours, in particular, we explore the extent to which network structural characteristics, metadata and user profiles could complement the linguistic analysis of the written content for the task at hand. The current findings strongly demonstrate that supplementary sources of information play significant role in classifying the veracity and the stance of Twitter interactions deemed to be rumourous.

BLCU_NLP at SemEval-2019 Task 7: An Inference Chain-based GPT Model for Rumour Evaluation
Ruoyao Yang | Wanying Xie | Chunhua Liu | Dong Yu

Researchers have been paying increasing attention to rumour evaluation due to the rapid spread of unsubstantiated rumours on social media platforms, including SemEval 2019 task 7. However, labelled data for learning rumour veracity is scarce, and labels in rumour stance data are highly disproportionate, making it challenging for a model to perform supervised-learning adequately. We propose an inference chain-based system, which fully utilizes conversation structure-based knowledge in the limited data and expand the training data in minority categories to alleviate class imbalance. Our approach obtains 12.6% improvement upon the baseline system for subtask A, ranks 1st among 21 systems in subtask A, and ranks 4th among 12 systems in subtask B.

BUT-FIT at SemEval-2019 Task 7: Determining the Rumour Stance with Pre-Trained Deep Bidirectional Transformers
Martin Fajcik | Pavel Smrz | Lukas Burget

This paper describes our system submitted to SemEval 2019 Task 7: RumourEval 2019: Determining Rumour Veracity and Support for Rumours, Subtask A (Gorrell et al., 2019). The challenge focused on classifying whether posts from Twitter and Reddit support, deny, query, or comment a hidden rumour, truthfulness of which is the topic of an underlying discussion thread. We formulate the problem as a stance classification, determining the rumour stance of a post with respect to the previous thread post and the source thread post. The recent BERT architecture was employed to build an end-to-end system which has reached the F1 score of 61.67 % on the provided test data. Without any hand-crafted feature, the system finished at the 2nd place in the competition, only 0.2 % behind the winner.

CLEARumor at SemEval-2019 Task 7: ConvoLving ELMo Against Rumors
Ipek Baris | Lukas Schmelzeisen | Steffen Staab

This paper describes our submission to SemEval-2019 Task 7: RumourEval: Determining Rumor Veracity and Support for Rumors. We participated in both subtasks. The goal of subtask A is to classify the type of interaction between a rumorous social media post and a reply post as support, query, deny, or comment. The goal of subtask B is to predict the veracity of a given rumor. For subtask A, we implement a CNN-based neural architecture using ELMo embeddings of post text combined with auxiliary features and achieve a F1-score of 44.6%. For subtask B, we employ a MLP neural network leveraging our estimates for subtask A and achieve a F1-score of 30.1% (second place in the competition). We provide results and analysis of our system performance and present ablation experiments.

Columbia at SemEval-2019 Task 7: Multi-task Learning for Stance Classification and Rumour Verification
Zhuoran Liu | Shivali Goel | Mukund Yelahanka Raghuprasad | Smaranda Muresan

The paper presents Columbia team’s participation in the SemEval 2019 Shared Task 7: RumourEval 2019. Detecting rumour on social networks has been a focus of research in recent years. Previous work suffered from data sparsity, which potentially limited the application of more sophisticated neural architecture to this task. We mitigate this problem by proposing a multi-task learning approach together with language model fine-tuning. Our attention-based model allows different tasks to leverage different level of information. Our system ranked 6th overall with an F1-score of 36.25 on stance classification and F1 of 22.44 on rumour verification.

GWU NLP at SemEval-2019 Task 7: Hybrid Pipeline for Rumour Veracity and Stance Classification on Social Media
Sardar Hamidian | Mona Diab

Social media plays a crucial role as the main resource news for information seekers online. However, the unmoderated feature of social media platforms lead to the emergence and spread of untrustworthy contents which harm individuals or even societies. Most of the current automated approaches for automatically determining the veracity of a rumor are not generalizable for novel emerging topics. This paper describes our hybrid system comprising rules and a machine learning model which makes use of replied tweets to identify the veracity of the source tweet. The proposed system in this paper achieved 0.435 F-Macro in stance classification, and 0.262 F-macro and 0.801 RMSE in rumor verification tasks in Task7 of SemEval 2019.

SINAI-DL at SemEval-2019 Task 7: Data Augmentation and Temporal Expressions
Miguel A. García-Cumbreras | Salud María Jiménez-Zafra | Arturo Montejo-Ráez | Manuel Carlos Díaz-Galiano | Estela Saquete

This paper describes the participation of the SINAI-DL team at RumourEval (Task 7 in SemEval 2019, subtask A: SDQC). SDQC addresses the challenge of rumour stance classification as an indirect way of identifying potential rumours. Given a tweet with several replies, our system classifies each reply into either supporting, denying, questioning or commenting on the underlying rumours. We have applied data augmentation, temporal expressions labelling and transfer learning with a four-layer neural classifier. We achieve an accuracy of 0.715 with the official run over reply tweets.

UPV-28-UNITO at SemEval-2019 Task 7: Exploiting Post’s Nesting and Syntax Information for Rumor Stance Classification
Bilal Ghanem | Alessandra Teresa Cignarella | Cristina Bosco | Paolo Rosso | Francisco Manuel Rangel Pardo

In the present paper we describe the UPV-28-UNITO system’s submission to the RumorEval 2019 shared task. The approach we applied for addressing both the subtasks of the contest exploits both classical machine learning algorithms and word embeddings, and it is based on diverse groups of features: stylistic, lexical, emotional, sentiment, meta-structural and Twitter-based. A novel set of features that take advantage of the syntactic information in texts is moreover introduced in the paper.

BLCU_NLP at SemEval-2019 Task 8: A Contextual Knowledge-enhanced GPT Model for Fact Checking
Wanying Xie | Mengxi Que | Ruoyao Yang | Chunhua Liu | Dong Yu

Since the resources of Community Question Answering are abundant and information sharing becomes universal, it will be increasingly difficult to find factual information for questioners in massive messages. SemEval 2019 task 8 is focusing on these issues. We participate in the task and use Generative Pre-trained Transformer (OpenAI GPT) as our system. Our innovations are data extension, feature extraction, and input transformation. For contextual knowledge enhancement, we extend the training set of subtask A, use several features to improve the results of our system and adapt the input formats to be more suitable for this task. We demonstrate the effectiveness of our approaches, which achieves 81.95% of subtask A and 61.08% of subtask B in accuracy on the SemEval 2019 task 8.

CodeForTheChange at SemEval-2019 Task 8: Skip-Thoughts for Fact Checking in Community Question Answering
Adithya Avvaru | Anupam Pandey

The strengths of the scalable gradient tree boosting algorithm, XGBoost and distributed sentence encoder, Skip-Thought Vectors are not explored yet by the cQA research community. We tried to apply and combine these two effective methods for finding factual nature of the questions and answers. The work also include experimentation with other popular classifier models like AdaBoost Classifier, DecisionTree Classifier, RandomForest Classifier, ExtraTrees Classifier, XGBoost Classifier and Multi-layer Neural Network. In this paper, we present the features used, approaches followed for feature engineering, models experimented with and finally the results.

ColumbiaNLP at SemEval-2019 Task 8: The Answer is Language Model Fine-tuning
Tuhin Chakrabarty | Smaranda Muresan

Community Question Answering forums are very popular nowadays, as they represent effective means for communities to share information around particular topics. But the information shared on these forums are often not authentic. This paper presents the ColumbiaNLP submission for the SemEval-2019 Task 8: Fact-Checking in Community Question Answering Forums. We show how fine-tuning a language model on a large unannotated corpus of old threads from Qatar Living forum helps us to classify question types (factual, opinion, socializing) and to judge the factuality of answers on the shared task labeled data from the same forum. Our system finished 4th and 2nd on Subtask A (question type classification) and B (answer factuality prediction), respectively, based on the official metric of accuracy.

DOMLIN at SemEval-2019 Task 8: Automated Fact Checking exploiting Ratings in Community Question Answering Forums
Dominik Stammbach | Stalin Varanasi | Guenter Neumann

In the following, we describe our system developed for the Semeval2019 Task 8. We fine-tuned a BERT checkpoint on the qatar living forum dump and used this checkpoint to train a number of models. Our hand-in for subtask A consists of a fine-tuned classifier from this BERT checkpoint. For subtask B, we first have a classifier deciding whether a comment is factual or non-factual. If it is factual, we retrieve intra-forum evidence and using this evidence, have a classifier deciding the comment’s veracity. We trained this classifier on ratings which we crawled from

DUTH at SemEval-2019 Task 8: Part-Of-Speech Features for Question Classification
Anastasios Bairaktaris | Symeon Symeonidis | Avi Arampatzis

This report describes the methods employed by the Democritus University of Thrace (DUTH) team for participating in SemEval-2019 Task 8: Fact Checking in Community Question Answering Forums. Our team dealt only with Subtask A: Question Classification. Our approach was based on shallow natural language processing (NLP) pre-processing techniques to reduce the noise in data, feature selection methods, and supervised machine learning algorithms such as NearestCentroid, Perceptron, and LinearSVC. To determine the essential features, we were aided by exploratory data analysis and visualizations. In order to improve classification accuracy, we developed a customized list of stopwords, retaining some opinion- and fact-denoting common function words which would have been removed by standard stoplisting. Furthermore, we examined the usefulness of part-of-speech (POS) categories for the task; by trying to remove nouns and adjectives, we found some evidence that verbs are a valuable POS category for the opinion question class.

Fermi at SemEval-2019 Task 8: An elementary but effective approach to Question Discernment in Community QA Forums
Bakhtiyar Syed | Vijayasaradhi Indurthi | Manish Shrivastava | Manish Gupta | Vasudeva Varma

Online Community Question Answering Forums (cQA) have gained massive popularity within recent years. The rise in users for such forums have led to the increase in the need for automated evaluation for question comprehension and fact evaluation of the answers provided by various participants in the forum. Our team, Fermi, participated in sub-task A of Task 8 at SemEval 2019 - which tackles the first problem in the pipeline of factual evaluation in cQA forums, i.e., deciding whether a posed question asks for a factual information, an opinion/advice or is just socializing. This information is highly useful in segregating factual questions from non-factual ones which highly helps in organizing the questions into useful categories and trims down the problem space for the next task in the pipeline for fact evaluation among the available answers. Our system uses the embeddings obtained from Universal Sentence Encoder combined with XGBoost for the classification sub-task A. We also evaluate other combinations of embeddings and off-the-shelf machine learning algorithms to demonstrate the efficacy of the various representations and their combinations. Our results across the evaluation test set gave an accuracy of 84% and received the first position in the final standings judged by the organizers.

SolomonLab at SemEval-2019 Task 8: Question Factuality and Answer Veracity Prediction in Community Forums
Ankita Gupta | Sudeep Kumar Sahoo | Divya Prakash | Rohit R.R | Vertika Srivastava | Yeon Hyang Kim

We describe our system for SemEval-2019, Task 8 on “Fact-Checking in Community Question Answering Forums (cQA)”. cQA forums are very prevalent nowadays, as they provide an effective means for communities to share knowledge. Unfortunately, this shared information is not always factual and fact-verified. In this task, we aim to identify factual questions posted on cQA and verify the veracity of answers to these questions. Our approach relies on data augmentation and aggregates cues from several dimensions such as semantics, linguistics, syntax, writing style and evidence obtained from trusted external sources. In subtask A, our submission is ranked 3rd, with an accuracy of 83.14%. Our current best solution stands 1st on the leaderboard with 88% accuracy. In subtask B, our present solution is ranked 2nd, with 58.33% MAP score.

TMLab SRPOL at SemEval-2019 Task 8: Fact Checking in Community Question Answering Forums
Piotr Niewiński | Aleksander Wawer | Maria Pszona | Maria Janicka

The article describes our submission to SemEval 2019 Task 8 on Fact-Checking in Community Forums. The systems under discussion participated in Subtask A: decide whether a question asks for factual information, opinion/advice or is just socializing. Our primary submission was ranked as the second one among all participants in the official evaluation phase. The article presents our primary solution: Deeply Regularized Residual Neural Network (DRR NN) with Universal Sentence Encoder embeddings. This is followed by a description of two contrastive solutions based on ensemble methods.

TueFact at SemEval 2019 Task 8: Fact checking in community question answering forums: context matters
Réka Juhász | Franziska Barbara Linnenschmidt | Teslin Roys

The SemEval 2019 Task 8 on Fact-Checking in community question answering forums aimed to classify questions into categories and verify the correctness of answers given on the QatarLiving public forum. The task was divided into two subtasks: the first classifying the question, the second the answers. The TueFact system described in this paper used different approaches for the two subtasks. Subtask A makes use of word vectors based on a bag-of-word-ngram model using up to trigrams. Predictions are done using multi-class logistic regression. The official SemEval result lists an accuracy of 0.60. Subtask B uses vectorized character n-grams up to trigrams instead. Predictions are done using a LSTM model and achieved an accuracy of 0.53 on the final SemEval Task 8 evaluation set.

YNU-HPCC at SemEval-2019 Task 8: Using A LSTM-Attention Model for Fact-Checking in Community Forums
Peng Liu | Jin Wang | Xuejie Zhang

We propose a system that uses a long short-term memory with attention mechanism (LSTM-Attention) model to complete the task. The LSTM-Attention model uses two LSTM to extract the features of the question and answer pair. Then, each of the features is sequentially composed using the attention mechanism, concatenating the two vectors into one. Finally, the concatenated vector is used as input for the MLP and the MLP’s output layer uses the softmax function to classify the provided answers into three categories. This model is capable of extracting the features of the question and answer pair well. The results show that the proposed system outperforms the baseline algorithm.

DBMS-KU at SemEval-2019 Task 9: Exploring Machine Learning Approaches in Classifying Text as Suggestion or Non-Suggestion
Tirana Fatyanosa | Al Hafiz Akbar Maulana Siagian | Masayoshi Aritsugi

This paper describes the participation of DBMS-KU team in the SemEval 2019 Task 9, that is, suggestion mining from online reviews and forums. To deal with this task, we explore several machine learning approaches, i.e., Random Forest (RF), Logistic Regression (LR), Multinomial Naive Bayes (MNB), Linear Support Vector Classification (LSVC), Sublinear Support Vector Classification (SSVC), Convolutional Neural Network (CNN), and Variable Length Chromosome Genetic Algorithm-Naive Bayes (VLCGA-NB). Our system obtains reasonable results of F1-Score 0.47 and 0.37 on the evaluation data in Subtask A and Subtask B, respectively. In particular, our obtained results outperform the baseline in Subtask A. Interestingly, the results seem to show that our system could perform well in classifying Non-suggestion class.

DS at SemEval-2019 Task 9: From Suggestion Mining with neural networks to adversarial cross-domain classification
Tobias Cabanski

Suggestion Mining is the task of classifying sentences into suggestions or non-suggestions. SemEval-2019 Task 9 sets the task to mine suggestions from online texts. For each of the two subtasks, the classification has to be applied on a different domain. Subtask A addresses the domain of posts in suggestion online forums and comes with a set of training examples, that is used for supervised training. A combination of LSTM and CNN networks is constructed to create a model which uses BERT word embeddings as input features. For subtask B, the domain of hotel reviews is regarded. In contrast to subtask A, no labeled data for supervised training is provided, so that additional unlabeled data is taken to apply a cross-domain classification. This is done by using adversarial training of the three model parts label classifier, domain classifier and the shared feature representation. For subtask A, the developed model archives a F1-score of 0.7273, which is in the top ten of the leader board. The F1-score for subtask B is 0.8187 and is ranked in the top five of the submissions for that task.

Hybrid RNN at SemEval-2019 Task 9: Blending Information Sources for Domain-Independent Suggestion Mining
Aysu Ezen-Can | Ethem F. Can

Social media has an increasing amount of information that both customers and companies can benefit from. These social media posts can include Tweets or be in the form of vocalization of complements and complaints (e.g., reviews) of a product or service. Researchers have been actively mining this invaluable information source to automatically generate insights. Mining sentiments of customer reviews is an example that has gained momentum due to its potential to gather information that customers are not happy about. Instead of reading millions of reviews, companies prefer sentiment analysis to obtain feedback and to improve their products or services. In this work, we aim to identify information that companies can act on, or other customers can utilize for making their own experience better. This is different from identifying if reviews of a product or service is negative, positive, or neutral. To that end, we classify sentences of a given review as suggestion or not suggestion so that readers of the reviews do not have to go through thousands of reviews but instead can focus on actionable items and applicable suggestions. To identify suggestions within reviews, we employ a hybrid approach that utilizes a recurrent neural network (RNN) along with rule-based features to build a domain-independent suggestion mining model. In this way, a model trained on electronics reviews is used to extract suggestions from hotel reviews.

INRIA at SemEval-2019 Task 9: Suggestion Mining Using SVM with Handcrafted Features
Ilia Markov | Eric Villemonte de la Clergerie

We present the INRIA approach to the suggestion mining task at SemEval 2019. The task consists of two subtasks: suggestion mining under single-domain (Subtask A) and cross-domain (Subtask B) settings. We used the Support Vector Machines algorithm trained on handcrafted features, function words, sentiment features, digits, and verbs for Subtask A, and handcrafted features for Subtask B. Our best run archived a F1-score of 51.18% on Subtask A, and ranked in the top ten of the submissions for Subtask B with 73.30% F1-score.

Lijunyi at SemEval-2019 Task 9: An attention-based LSTM and ensemble of different models for suggestion mining from online reviews and forums
Junyi Li

In this paper, we describe a suggestion mining system that participated in SemEval 2019 Task 9, SubTask A - Suggestion Mining from Online Reviews and Forums. Given some suggestions from online reviews and forums that can be classified into suggestion and non-suggestion classes. In this task, we combine the attention mechanism with the LSTM model, which is the final system we submitted. The final submission achieves 14th place in Task 9, SubTask A with the accuracy of 0.6776. After the challenge, we train a series of neural network models such as convolutional neural network(CNN), TextCNN, long short-term memory(LSTM) and C-LSTM. Finally, we make an ensemble on the predictions of these models and get a better result.

MIDAS at SemEval-2019 Task 9: Suggestion Mining from Online Reviews using ULMFit
Sarthak Anand | Debanjan Mahata | Kartik Aggarwal | Laiba Mehnaz | Simra Shahid | Haimin Zhang | Yaman Kumar | Rajiv Shah | Karan Uppal

In this paper we present our approach to tackle the Suggestion Mining from Online Reviews and Forums Sub-Task A. Given a review, we are asked to predict whether the review consists of a suggestion or not. Our model is based on Universal Language Model Fine-tuning for Text Classification. We apply various pre-processing techniques before training the language and the classification model. We further provide analysis of the model. Our team ranked 10th out of 34 participants, achieving an F1 score of 0.7011.

NL-FIIT at SemEval-2019 Task 9: Neural Model Ensemble for Suggestion Mining
Samuel Pecar | Marian Simko | Maria Bielikova

In this paper, we present neural model architecture submitted to the SemEval-2019 Task 9 competition: “Suggestion Mining from Online Reviews and Forums”. We participated in both subtasks for domain specific and also cross-domain suggestion mining. We proposed a recurrent neural network architecture that employs Bi-LSTM layers and also self-attention mechanism. Our architecture tries to encode words via word representation using ELMo and ensembles multiple models to achieve better results. We highlight importance of pre-processing of user-generated samples and its contribution to overall results. We performed experiments with different setups of our proposed model involving weighting of prediction classes for loss function. Our best model achieved in official test evaluation score of 0.6816 for subtask A and 0.6850 for subtask B. In official results, we achieved 12th and 10th place in subtasks A and B, respectively.

NTUA-ISLab at SemEval-2019 Task 9: Mining Suggestions in the wild
Rolandos Alexandros Potamias | Alexandros Neofytou | Georgios Siolas

As online customer forums and product comparison sites increase their societal influence, users are actively expressing their opinions and posting their recommendations on their fellow customers online. However, systems capable of recognizing suggestions still lack in stability. Suggestion Mining, a novel and challenging field of Natural Language Processing, is increasingly gaining attention, aiming to track user advice on online forums. In this paper, a carefully designed methodology to identify customer-to-company and customer-to-customer suggestions is presented. The methodology implements a rule-based classifier using heuristic, lexical and syntactic patterns. The approach ranked at 5th and 1st position, achieving an f1-score of 0.749 and 0.858 for SemEval-2019/Suggestion Mining sub-tasks A and B, respectively. In addition, we were able to improve performance results by combining the rule-based classifier with a recurrent convolutional neural network, that exhibits an f1-score of 0.79 for subtask A.

OleNet at SemEval-2019 Task 9: BERT based Multi-Perspective Models for Suggestion Mining
Jiaxiang Liu | Shuohuan Wang | Yu Sun

This paper describes our system partici- pated in Task 9 of SemEval-2019: the task is focused on suggestion mining and it aims to classify given sentences into sug- gestion and non-suggestion classes in do- main specific and cross domain training setting respectively. We propose a multi- perspective architecture for learning rep- resentations by using different classical models including Convolutional Neural Networks (CNN), Gated Recurrent Units (GRU), Feed Forward Attention (FFA), etc. To leverage the semantics distributed in large amount of unsupervised data, we also have adopted the pre-trained Bidi- rectional Encoder Representations from Transformers (BERT) model as an en- coder to produce sentence and word rep- resentations. The proposed architecture is applied for both sub-tasks, and achieved f1-score of 0.7812 for subtask A, and 0.8579 for subtask B. We won the first and second place for the two tasks respec- tively in the final competition.

SSN-SPARKS at SemEval-2019 Task 9: Mining Suggestions from Online Reviews using Deep Learning Techniques on Augmented Data
Rajalakshmi S | Angel Suseelan | S Milton Rajendram | Mirnalinee T T

This paper describes the work on mining the suggestions from online reviews and forums. Opinion mining detects whether the comments are positive, negative or neutral, while suggestion mining explores the review content for the possible tips or advice. The system developed by SSN-SPARKS team in SemEval-2019 for task 9 (suggestion mining) uses a rule-based approach for feature selection, SMOTE technique for data augmentation and deep learning technique (Convolutional Neural Network) for classification. We have compared the results with Random Forest classifier (RF) and MultiLayer Perceptron (MLP) model. Results show that the CNN model performs better than other models for both the subtasks.

Suggestion Miner at SemEval-2019 Task 9: Suggestion Detection in Online Forum using Word Graph
Usman Ahmed | Humera Liaquat | Luqman Ahmed | Syed Jawad Hussain

This paper describes the suggestion miner system that participates in SemEval 2019 Task 9 - SubTask A - Suggestion Mining from Online Reviews and Forums. The system participated in the subtasks A. This paper discusses the results of our system in the development, evaluation and post evaluation. Each class in the dataset is represented as directed unweighted graphs. Then, the comparison is carried out with each class graph which results in a vector. This vector is used as features by a machine learning algorithm. The model is evaluated on hold on strategy. The organizers randomly split (8500 instances) training set (provided to the participant in training their system) and testing set (833 instances). The test set is reserved to evaluate the performance of participants systems. During the evaluation, our system ranked 31 in the Coda Lab result of the subtask A (binary class problem). The binary class system achieves evaluation value 0.34, precision 0.87, recall 0.73 and F measure 0.78.

Team Taurus at SemEval-2019 Task 9: Expert-informed pattern recognition for suggestion mining
Nelleke Oostdijk | Hans van Halteren

This paper presents our submissions to SemEval-2019 Task9, Suggestion Mining. Our system is one in a series of systems in which we compare an approach using expert-defined rules with a comparable one using machine learning. We target tasks with a syntactic or semantic component that might be better described by a human understanding the task than by a machine learner only able to count features. For Semeval-2019 Task 9, the expert rules clearly outperformed our machine learning model when training and testing on equally balanced testsets.

ThisIsCompetition at SemEval-2019 Task 9: BERT is unstable for out-of-domain samples
Cheoneum Park | Juae Kim | Hyeon-gu Lee | Reinald Kim Amplayo | Harksoo Kim | Jungyun Seo | Changki Lee

This paper describes our system, Joint Encoders for Stable Suggestion Inference (JESSI), for the SemEval 2019 Task 9: Suggestion Mining from Online Reviews and Forums. JESSI is a combination of two sentence encoders: (a) one using multiple pre-trained word embeddings learned from log-bilinear regression (GloVe) and translation (CoVe) models, and (b) one on top of word encodings from a pre-trained deep bidirectional transformer (BERT). We include a domain adversarial training module when training for out-of-domain samples. Our experiments show that while BERT performs exceptionally well for in-domain samples, several runs of the model show that it is unstable for out-of-domain samples. The problem is mitigated tremendously by (1) combining BERT with a non-BERT encoder, and (2) using an RNN-based classifier on top of BERT. Our final models obtained second place with 77.78% F-Score on Subtask A (i.e. in-domain) and achieved an F-Score of 79.59% on Subtask B (i.e. out-of-domain), even without using any additional external data.

WUT at SemEval-2019 Task 9: Domain-Adversarial Neural Networks for Domain Adaptation in Suggestion Mining
Mateusz Klimaszewski | Piotr Andruszkiewicz

We present a system for cross-domain suggestion mining, prepared for the SemEval-2019 Task 9: Suggestion Mining from Online Reviews and Forums (Subtask B). Our submitted solution for this text classification problem explores the idea of treating different suggestions’ sources as one of the settings of Transfer Learning - Domain Adaptation. Our experiments show that without any labeled target domain examples during training time, we are capable of proposing a system, reaching up to 0.778 in terms of F1 score on test dataset, based on Target Preserving Domain Adversarial Neural Networks.

Yimmon at SemEval-2019 Task 9: Suggestion Mining with Hybrid Augmented Approaches
Yimeng Zhuang

Suggestion mining task aims to extract tips, advice, and recommendations from unstructured text. The task includes many challenges, such as class imbalance, figurative expressions, context dependency, and long and complex sentences. This paper gives a detailed system description of our submission in SemEval 2019 Task 9 Subtask A. We transfer Self-Attention Network (SAN), a successful model in machine reading comprehension field, into this task. Our model concentrates on modeling long-term dependency which is indispensable to parse long and complex sentences. Besides, we also adopt techniques, such as contextualized embedding, back-translation, and auxiliary loss, to augment the system. Our model achieves a performance of F1=76.3, and rank 4th among 34 participating systems. Further ablation study shows that the techniques used in our system are beneficial to the performance.

YNU_DYX at SemEval-2019 Task 9: A Stacked BiLSTM for Suggestion Mining Classification
Yunxia Ding | Xiaobing Zhou | Xuejie Zhang

In this paper we describe a deep-learning system that competed as SemEval 2019 Task 9-SubTask A: Suggestion Mining from Online Reviews and Forums. We use Word2Vec to learn the distributed representations from sentences. This system is composed of a Stacked Bidirectional Long-Short Memory Network (SBiLSTM) for enriching word representations before and after the sequence relationship with context. We perform an ensemble to improve the effectiveness of our model. Our official submission results achieve an F1-score 0.5659.

YNU-HPCC at SemEval-2019 Task 9: Using a BERT and CNN-BiLSTM-GRU Model for Suggestion Mining
Ping Yue | Jin Wang | Xuejie Zhang

Consumer opinions towards commercial entities are generally expressed through online reviews, blogs, and discussion forums. These opinions largely express positive and negative sentiments towards a given entity,but also tend to contain suggestions for improving the entity. In this task, we extract suggestions from given the unstructured text, compared to the traditional opinion mining systems. Such suggestion mining is more applicability and extends capabilities.

Zoho at SemEval-2019 Task 9: Semi-supervised Domain Adaptation using Tri-training for Suggestion Mining
Sai Prasanna | Sri Ananda Seelan

This paper describes our submission for the SemEval-2019 Suggestion Mining task. A simple Convolutional Neural Network (CNN) classifier with contextual word representations from a pre-trained language model was used for sentence classification. The model is trained using tri-training, a semi-supervised bootstrapping mechanism for labelling unseen data. Tri-training proved to be an effective technique to accommodate domain shift for cross-domain suggestion mining (Subtask B) where there is no hand labelled training data. For in-domain evaluation (Subtask A), we use the same technique to augment the training set. Our system ranks thirteenth in Subtask A with an F1-score of 68.07 and third in Subtask B with an F1-score of 81.94.

ZQM at SemEval-2019 Task9: A Single Layer CNN Based on Pre-trained Model for Suggestion Mining
Qimin Zhou | Zhengxin Zhang | Hao Wu | Linmao Wang

This paper describes our system that competed at SemEval 2019 Task 9 - SubTask A: ”Sug- gestion Mining from Online Reviews and Forums”. Our system fuses the convolutional neural network and the latest BERT model to conduct suggestion mining. In our system, the input of convolutional neural network is the embedding vectors which are drawn from the pre-trained BERT model. And to enhance the effectiveness of the whole system, the pre-trained BERT model is fine-tuned by provided datasets before the procedure of embedding vectors extraction. Empirical results show the effectiveness of our model which obtained 9th position out of 34 teams with F1 score equals to 0.715.

ProblemSolver at SemEval-2019 Task 10: Sequence-to-Sequence Learning and Expression Trees
Xuefeng Luo | Alina Baranova | Jonas Biegert

This paper describes our participation in SemEval-2019 shared task “Math Question Answering”, where the aim is to create a program that could solve the Math SAT questions automatically as accurately as possible. We went with a dual-pronged approach, building a Sequence-to-Sequence Neural Network pre-trained with augmented data that could answer all categories of questions and a Tree system, which can only answer a certain type of questions. The systems did not perform well on the entire test data given in the task, but did decently on the questions they were actually capable of answering. The Sequence-to-Sequence Neural Network model managed to get slightly better than our baseline of guessing “A” for every question, while the Tree system additionally improved the results.

RGCL-WLV at SemEval-2019 Task 12: Toponym Detection
Alistair Plum | Tharindu Ranasinghe | Pablo Calleja | Constantin Orăsan | Ruslan Mitkov

This article describes the system submitted by the RGCL-WLV team to the SemEval 2019 Task 12: Toponym resolution in scientific papers. The system detects toponyms using a bootstrapped machine learning (ML) approach which classifies names identified using gazetteers extracted from the GeoNames geographical database. The paper evaluates the performance of several ML classifiers, as well as how the gazetteers influence the accuracy of the system. Several runs were submitted. The highest precision achieved for one of the submissions was 89%, albeit it at a relatively low recall of 49%.

THU_NGN at SemEval-2019 Task 12: Toponym Detection and Disambiguation on Scientific Papers
Tao Qi | Suyu Ge | Chuhan Wu | Yubo Chen | Yongfeng Huang

First name: Tao Last name: Qi Email: Affiliation: Department of Electronic Engineering, Tsinghua University First name: Suyu Last name: Ge Email: Affiliation: Department of Electronic Engineering, Tsinghua University First name: Chuhan Last name: Wu Email: Affiliation: Department of Electronic Engineering, Tsinghua University First name: Yubo Last name: Chen Email: Affiliation: Department of Electronic Engineering, Tsinghua University First name: Yongfeng Last name: Huang Email: Affiliation: Department of Electronic Engineering, Tsinghua University Toponym resolution is an important and challenging task in the neural language processing field, and has wide applications such as emergency response and social media geographical event analysis. Toponym resolution can be roughly divided into two independent steps, i.e., toponym detection and toponym disambiguation. In order to facilitate the study on toponym resolution, the SemEval 2019 task 12 is proposed, which contains three subtasks, i.e., toponym detection, toponym disambiguation and toponym resolution. In this paper, we introduce our system that participated in the SemEval 2019 task 12. For toponym detection, in our approach we use TagLM as the basic model, and explore the use of various features in this task, such as word embeddings extracted from pre-trained language models, POS tags and lexical features extracted from dictionaries. For toponym disambiguation, we propose a heuristics rule-based method using toponym frequency and population. Our systems achieved 83.03% strict macro F1, 74.50 strict micro F1, 85.92 overlap macro F1 and 78.47 overlap micro F1 in toponym detection subtask.

UNH at SemEval-2019 Task 12: Toponym Resolution in Scientific Papers
Matthew Magnusson | Laura Dietz

The SemEval-2019 Task 12 is toponym resolution in scientific papers. We focus on Subtask 1: Toponym Detection which is the identification of spans of text for place names mentioned in a document. We propose two methods: 1) sliding window convolutional neural network using ELMo embeddings (cnn-elmo), and 2) sliding window multi-Layer perceptron using ELMo embeddings (mlp-elmo). We also submit Bi-lateral LSTM with Conditional Random Fields (bi-LSTM) as a strong baseline given its state-of-art performance in Named Entity Recognition (NER) task. Our best performing model is cnn-elmo with a F1 of 0.844 which was below bi-LSTM F1 of 0.862 when evaluated on overlap macro detection. Eight teams participated in this subtask with a total of 21 submissions.

UniMelb at SemEval-2019 Task 12: Multi-model combination for toponym resolution
Haonan Li | Minghan Wang | Timothy Baldwin | Martin Tomko | Maria Vasardani

This paper describes our submission to SemEval-2019 Task 12 on toponym resolution over scientific articles. We train separate NER models for toponym detection over text extracted from tables vs. text from the body of the paper, and train another auxiliary model to eliminate misdetected toponyms. For toponym disambiguation, we use an SVM classifier with hand-engineered features. The best setting achieved a strict micro-F1 score of 80.92% and overlap micro-F1 score of 86.88% in the toponym detection subtask, ranking 2nd out of 8 teams on F1 score. For toponym disambiguation and end-to-end resolution, we officially ranked 2nd and 3rd, respectively.

University of Arizona at SemEval-2019 Task 12: Deep-Affix Named Entity Recognition of Geolocation Entities
Vikas Yadav | Egoitz Laparra | Ti-Tai Wang | Mihai Surdeanu | Steven Bethard

We present the Named Entity Recognition (NER) and disambiguation model used by the University of Arizona team (UArizona) for the SemEval 2019 task 12. We achieved fourth place on tasks 1 and 3. We implemented a deep-affix based LSTM-CRF NER model for task 1, which utilizes only character, word, pre- fix and suffix information for the identification of geolocation entities. Despite using just the training data provided by task organizers and not using any lexicon features, we achieved 78.85% strict micro F-score on task 1. We used the unsupervised population heuristics for task 3 and achieved 52.99% strict micro-F1 score in this task.