Lexical Semantic Change detection, i.e., the task of identifying words that change meaning over time, is a very active research area, with applications in NLP, lexicography, and linguistics. Evaluation is currently the most pressing problem in Lexical Semantic Change detection, as no gold standards are available to the community, which hinders progress. We present the results of the first shared task that addresses this gap by providing researchers with an evaluation framework and manually annotated, high-quality datasets for English, German, Latin, and Swedish. 33 teams submitted 186 systems, which were evaluated on two subtasks.
Lexical entailment (LE) is a fundamental asymmetric lexico-semantic relation, supporting the hierarchies in lexical resources (e.g., WordNet, ConceptNet) and applications like natural language inference and taxonomy induction. Multilingual and cross-lingual NLP applications warrant models for LE detection that go beyond language boundaries. As part of SemEval 2020, we carried out a shared task (Task 2) on multilingual and cross-lingual LE. The shared task spans three dimensions: (1) monolingual vs. cross-lingual LE, (2) binary vs. graded LE, and (3) a set of 6 diverse languages (and 15 corresponding language pairs). We offered two different evaluation tracks: (a) Dist: for unsupervised, fully distributional models that capture LE solely on the basis of unannotated corpora, and (b) Any: for externally informed models, allowed to leverage any resources, including lexico-semantic networks (e.g., WordNet or BabelNet). In the Any track, we recieved runs that push state-of-the-art across all languages and language pairs, for both binary LE detection and graded LE prediction.
This paper presents the Graded Word Similarity in Context (GWSC) task which asked participants to predict the effects of context on human perception of similarity in English, Croatian, Slovene and Finnish. We received 15 submissions and 11 system description papers. A new dataset (CoSimLex) was created for evaluation in this task: it contains pairs of words, each annotated within two different contexts. Systems beat the baselines by significant margins, but few did well in more than one language or subtask. Almost every system employed a Transformer model, but with many variations in the details: WordNet sense embeddings, translation of contexts, TF-IDF weightings, and the automatic creation of datasets for fine-tuning were all used to good effect.
This paper describes DiaSense, a system developed for Task 1 ‘Unsupervised Lexical Semantic Change Detection’ of SemEval 2020. In DiaSense, contextualized word embeddings are used to model word sense changes. This allows for the calculation of metrics which mimic human intuitions about the semantic relatedness between individual use pairs of a target word for the assessment of lexical semantic change. DiaSense is able to detect lexical semantic change in English, German, Latin and Swedish (accuracy = 0.728). Moreover, DiaSense differentiates between weak and strong change.
This paper describes the system submitted by our team (BabelEnconding) to SemEval-2020 Task 3: Predicting the Graded Effect of Context in Word Similarity. We propose an approach that relies on translation and multilingual language models in order to compute the contextual similarity between pairs of words. Our hypothesis is that evidence from additional languages can leverage the correlation with the human generated scores. BabelEnconding was applied to both subtasks and ranked among the top-3 in six out of eight task/language combinations and was the highest scoring system three times.
This paper describes the approaches used by the Discovery Team to solve SemEval-2020 Task 1 - Unsupervised Lexical Semantic Change Detection. The proposed method is based on clustering of BERT contextual embeddings, followed by a comparison of cluster distributions across time. The best results were obtained by an ensemble of this method and static Word2Vec embeddings. According to the official results, our approach proved the best for Latin in Subtask 2.
This paper describes the system proposed by the Random team for SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection. We focus our approach on the detection problem. Given the semantics of words captured by temporal word embeddings in different time periods, we investigate the use of unsupervised methods to detect when the target word has gained or lost senses. To this end, we define a new algorithm based on Gaussian Mixture Models to cluster the target similarities computed over the two periods. We compare the proposed approach with a number of similarity-based thresholds. We found that, although the performance of the detection methods varies across the word embedding algorithms, the combination of Gaussian Mixture with Temporal Referencing resulted in our best system.
We present the results of our system for SemEval-2020 Task 1 that exploits a commonly used lexical semantic change detection model based on Skip-Gram with Negative Sampling. Our system focuses on Vector Initialization (VI) alignment, compares VI to the currently top-ranking models for Subtask 2 and demonstrates that these can be outperformed if we optimize VI dimensionality. We demonstrate that differences in performance can largely be attributed to model-specific sources of noise, and we reveal a strong relationship between dimensionality and frequency-induced noise in VI alignment. Our results suggest that lexical semantic change models integrating vector space alignment should pay more attention to the role of the dimensionality parameter.
In this paper, we present our contribution in SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection, where we systematically combine existing models for unsupervised capturing of lexical semantic change across time in text corpora of German, English, Latin and Swedish. In particular, we analyze the score distribution of existing models. Then we define a general threshold, adjust it independently to each of the models and measure the models’ score reliability. Finally, using both the threshold and score reliability, we aggregate the models for the two sub- tasks: binary classification and ranking.
This paper describes the model proposed and submitted by our RIJP team to SemEval 2020 Task1: Unsupervised Lexical Semantic Change Detection. In the model, words are represented by Gaussian distributions. For Subtask 1, the model achieved average scores of 0.51 and 0.70 in the evaluation and post-evaluation processes, respectively. The higher score in the post-evaluation process than that in the evaluation process was achieved owing to appropriate parameter tuning. The results indicate that the proposed Gaussian-based embedding model is able to express semantic shifts while having a low computational
This paper describes SChME (Semantic Change Detection with Model Ensemble), a method used in SemEval-2020 Task 1 on unsupervised detection of lexical semantic change. SChME uses a model ensemble combining signals distributional models (word embeddings) and word frequency where each model casts a vote indicating the probability that a word suffered semantic change according to that feature. More specifically, we combine cosine distance of word vectors combined with a neighborhood-based metric we named Mapped Neighborhood Distance (MAP), and a word frequency differential metric as input signals to our model. Additionally, we explore alignment-based methods to investigate the importance of the landmarks used in this process. Our results show evidence that the number of landmarks used for alignment has a direct impact on the predictive performance of the model. Moreover, we show that languages that suffer less semantic change tend to benefit from using a large number of landmarks, whereas languages with more semantic change benefit from a more careful choice of landmark number for alignment.
We (Team Skurt) propose a simple method to detect lexical semantic change by clustering contextualized embeddings produced by XLM-R, using K-Means++. The basic idea is that contextualized embeddings that encode the same sense are located in close proximity in the embedding space. Our approach is both simple and generic, but yet performs relatively good in both sub-tasks of SemEval-2020 Task 1. We hypothesize that the main shortcoming of our method lies in the simplicity of the clustering method used.
This paper describes the UCD system entered for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. We propose a novel method based on distance between temporally referenced nodes in a semantic network constructed from a combination of the time specific corpora. We argue for the value of semantic networks as objects for transparent exploratory analysis and visualisation of lexical semantic change, and present an implementation of a web application for the purpose of searching and visualising semantic networks. The results of the change measure used for this task were not among the best performing systems, but further calibration of the distance metric and backoff approaches may improve this method.
We apply contextualised word embeddings to lexical semantic change detection in the SemEval-2020 Shared Task 1. This paper focuses on Subtask 2, ranking words by the degree of their semantic drift over time. We analyse the performance of two contextualising architectures (BERT and ELMo) and three change detection algorithms. We find that the most effective algorithms rely on the cosine similarity between averaged token embeddings and the pairwise distances between token embeddings. They outperform strong baselines by a large margin (in the post-evaluation phase, we have the best Subtask 2 submission for SemEval-2020 Task 1), but interestingly, the choice of a particular algorithm depends on the distribution of gold scores in the test set.
In this paper we present a novel rule-based, language independent method for determining lexical entailment relations using semantic representations built from Wiktionary definitions. Combined with a simple WordNet-based method our system achieves top scores on the English and Italian datasets of the Semeval-2020 task “Predicting Multilingual and Cross-lingual (graded) Lexical Entailment” (Glavaš et al., 2020). A detailed error analysis of our output uncovers future di- rections for improving both the semantic parsing method and the inference process on semantic graphs.
This paper presents the team BRUMS submission to SemEval-2020 Task 3: Graded Word Similarity in Context. The system utilises state-of-the-art contextualised word embeddings, which have some task-specific adaptations, including stacked embeddings and average embeddings. Overall, the approach achieves good evaluation scores across all the languages, while maintaining simplicity. Following the final rankings, our approach is ranked within the top 5 solutions of each language while preserving the 1st position of Finnish subtask 2.
This paper presents our systems to solve Task 3 of Semeval-2020, which aims to predict the effect that context has on human perception of similarity of words. The task consists of two subtasks in English, Croatian, Finnish, and Slovenian: (1) predicting the change of similarity and (2) predicting the human scores of similarity, both of them for a pair of words within two different contexts. We tackled the problem by developing two systems, the first one uses a centroid approach and word vectors. The second one uses the ELMo language model, which is trained for each pair of words with the given context. Our approach achieved the highest score in subtask 2 for the English language.
We present the MULTISEM systems submitted to SemEval 2020 Task 3: Graded Word Similarity in Context (GWSC). We experiment with injecting semantic knowledge into pre-trained BERT models through fine-tuning on lexical semantic tasks related to GWSC. We use existing semantically annotated datasets, and propose to approximate similarity through automatically generated lexical substitutes in context. We participate in both GWSC subtasks and address two languages, English and Finnish. Our best English models occupy the third and fourth positions in the ranking for the two subtasks. Performance is lower for the Finnish models which are mid-ranked in the respective subtasks, highlighting the important role of data availability for fine-tuning.
CoSimLex is a dataset that can be used to evaluate the ability of context-dependent word embed- dings for modeling subtle, graded changes of meaning, as perceived by humans during reading. At SemEval-2020, task 3, subtask 1 is about ”predicting the (graded) effect of context in word similarity”, using CoSimLex to quantify such a change of similarity for a pair of words, from one context to another. Here, a meaning shift is composed of two aspects, a) discrete changes observed between different word senses, and b) more subtle changes of meaning representation that are not captured in those discrete changes. Therefore, this SemEval task was designed to allow the evaluation of systems that can deal with a mix of both situations of semantic shift, as they occur in the human perception of meaning. The described system was developed to improve the BERT baseline provided with the task, by reducing distortions in the BERT semantic space, compared to the human semantic space. To this end, complementarity between 768- and 1024-dimensional BERT embeddings, and average word sense vectors were used. With this system, after some fine-tuning, the baseline performance of 0.705 (uncentered Pearson correlation with human semantic shift data from 27 annotators) was enhanced by more than 6%, to 0.7645. We hope that this work can make a contribution to further our understanding of the semantic vector space of human perception, as it can be modeled with context-dependent word embeddings in natural language processing systems.
SemEval-2020 Task 1 is devoted to detection of changes in word meaning over time. The first subtask raises a question if a particular word has acquired or lost any of its senses during the given time period. The second subtask requires estimating the change in frequencies of the word senses. We have submitted two solutions for both subtasks. The first solution performs word sense induction (WSI) first, then makes the decision based on the induced word senses. We extend the existing WSI method based on clustering of lexical substitutes generated with neural language models and adapt it to the task. The second solution exploits a well-known approach to semantic change detection, that includes building word2vec SGNS vectors, aligning them with Orthogonal Procrustes and calculating cosine distance between resulting vectors. While WSI-based solution performs better in Subtask 1, which requires binary decisions, the second solution outperforms it in Subtask 2 and obtains the 3rd best result in this subtask.
This paper describes the winning contribution to SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection (Subtask 2) handed in by team UG Student Intern. We present an ensemble model that makes predictions based on context-free and context-dependent word representations. The key findings are that (1) context-free word representations are a powerful and robust baseline, (2) a sentence classification objective can be used to obtain useful context-dependent word representations, and (3) combining those representations increases performance on some datasets while decreasing performance on others.
This paper describes the system Clustering on Manifolds of Contextualized Embeddings (CMCE) submitted to the SemEval-2020 Task 1 on Unsupervised Lexical Semantic Change Detection. Subtask 1 asks to identify whether or not a word gained/lost a sense across two time periods. Subtask 2 is about computing a ranking of words according to the amount of change their senses underwent. Our system uses contextualized word embeddings from MBERT, whose dimensionality we reduce with an autoencoder and the UMAP algorithm, to be able to use a wider array of clustering algorithms that can automatically determine the number of clusters. We use Hierarchical Density Based Clustering (HDBSCAN) and compare it to Gaussian MixtureModels (GMMs) and other clustering algorithms. Remarkably, with only 10 dimensional MBERT embeddings (reduced from the original size of 768), our submitted model performs best on subtask 1 for English and ranks third in subtask 2 for English. In addition to describing our system, we discuss our hyperparameter configurations and examine why our system lags behind for the other languages involved in the shared task (German, Swedish, Latin). Our code is available at https://github.com/DavidRother/semeval2020-task1
We present a system for the task of unsupervised lexical change detection: given a target word and two corpora spanning different periods of time, automatically detects whether the word has lost or gained senses from one corpus to another. Our system employs the temporal referencing method to obtain compatible representations of target words in different periods of time. This is done by concatenating corpora of different periods and performing a temporal referencing of target words i.e., treating occurrences of target words in different periods as two independent tokens. Afterwards, we train word embeddings on the joint corpus and compare the referenced vectors of each target word using cosine similarity. Our submission was ranked 7th among 34 teams for subtask 1, obtaining an average accuracy of 0.637, only 0.050 points behind the first ranked system.
This paper describes EmbLexChange, a system introduced by the “Life-Language” team for SemEval-2020 Task 1, on unsupervised detection of lexical-semantic changes. EmbLexChange is defined as the divergence between the embedding based profiles of word w (calculated with respect to a set of reference words) in the source and the target domains (source and target domains can be simply two time frames t_1 and t_2). The underlying assumption is that the lexical-semantic change of word w would affect its co-occurring words and subsequently alters the neighborhoods in the embedding spaces. We show that using a resampling framework for the selection of reference words (with conserved senses), we can more reliably detect lexical-semantic changes in English, German, Swedish, and Latin. EmbLexChange achieved second place in the binary detection of semantic changes in the SemEval-2020.
This paper presents a vector initialization approach for the SemEval2020 Task 1: Unsupervised Lexical Semantic Change Detection. Given two corpora belonging to different time periods and a set of target words, this task requires us to classify whether a word gained or lost a sense over time (subtask 1) and to rank them on the basis of the changes in their word senses (subtask 2). The proposed approach is based on using Vector Initialization method to align GloVe embeddings. The idea is to consecutively train GloVe embeddings for both corpora, while using the first model to initialize the second one. This paper is based on the hypothesis that GloVe embeddings are more suited for the Vector Initialization method than SGNS embeddings. It presents an intuitive reasoning behind this hypothesis, and also talks about the impact of various factors and hyperparameters on the performance of the proposed approach. Our model ranks 12th and 10th among 33 teams in the two subtasks. The implementation has been shared publicly.
Lexical semantic change detection (also known as semantic shift tracing) is a task of identifying words that have changed their meaning over time. Unsupervised semantic shift tracing, focal point of SemEval2020, is particularly challenging. Given the unsupervised setup, in this work, we propose to identify clusters among different occurrences of each target word, considering these as representatives of different word meanings. As such, disagreements in obtained clusters naturally allow to quantify the level of semantic shift per each target word in four target languages. To leverage this idea, clustering is performed on contextualized (BERT-based) embeddings of word occurrences. The obtained results show that our approach performs well both measured separately (per language) and overall, where we surpass all provided SemEval baselines.
This paper describes our TemporalTeller system for SemEval Task 1: Unsupervised Lexical Semantic Change Detection. We develop a unified framework for the common semantic change detection pipelines including preprocessing, learning word embeddings, calculating vector distances and determining threshold. We also propose Gamma Quantile Threshold to distinguish between changed and stable words. Based on our system, we conduct a comprehensive comparison among BERT, Skip-gram, Temporal Referencing and alignment-based methods. Evaluation results show that Skip-gram with Temporal Referencing achieves the best performance of 66.5% classification accuracy and 51.8% Spearman’s Ranking Correlation.
This paper describes our system for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. Target words of corpora from two different time periods are classified according to their semantic change. The languages covered are English, German, Latin, and Swedish. Our approach involves clustering ELMo embeddings using DBSCAN and K-means. For a more fine grained detection of semantic change we take the Jensen-Shannon Distance metric and rank the target words from strongest to weakest change. The results show that this is a valid approach for the classification subtask where we rank 13th out of 33 groups with an accuracy score of 61.2%. For the ranking subtask we score a Spearman’s rank-order correlation coefficient of 0.087 which places us on rank 29.
Much as the social landscape in which languages are spoken shifts, language too evolves to suit the needs of its users. Lexical semantic change analysis is a burgeoning field of semantic analysis which aims to trace changes in the meanings of words over time. This paper presents an approach to lexical semantic change detection based on Bayesian word sense induction suitable for novel word sense identification. This approach is used for a submission to SemEval-2020 Task 1, which shows the approach to be capable of the SemEval task. The same approach is also applied to a corpus gleaned from 15 years of Twitter data, the results of which are then used to identify words which may be instances of slang.
In this paper, we describe our method for detection of lexical semantic change, i.e., word sense changes over time. We examine semantic differences between specific words in two corpora, chosen from different time periods, for English, German, Latin, and Swedish. Our method was created for the SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. We ranked 1st in Sub-task 1: binary change detection, and 4th in Sub-task 2: ranked change detection. We present our method which is completely unsupervised and language independent. It consists of preparing a semantic vector space for each corpus, earlier and later; computing a linear transformation between earlier and later spaces, using Canonical Correlation Analysis and orthogonal transformation;and measuring the cosines between the transformed vector for the target word from the earlier corpus and the vector for the target word in the later corpus.
Lexical entailment recognition plays an important role in tasks like Question Answering and Machine Translation. As important branches of lexical entailment, predicting multilingual and cross-lingual lexical entailment (LE) are two subtasks of SemEval2020 Task2. In previous monolingual LE studies, researchers leverage external linguistic constraints to transform word embeddings for LE relation. In our system, we expand the number of external constraints in multiple languages to obtain more specialised multilingual word embeddings. For the cross-lingual subtask, we apply a bilingual word embeddings mapping method in the model. The mapping method takes specialised embeddings as inputs and is able to retain the embeddings’ LE features after operations. Our results for multilingual subtask are about 20% and 10% higher than the baseline in graded and binary prediction respectively.
We investigate the hypothesis that translations can be used to identify cross-lingual lexical entailment. We propose novel methods that leverage parallel corpora, word embeddings, and multilingual lexical resources. Our results demonstrate that the implementation of these ideas leads to improvements in predicting entailment.
This paper describes the system we built for SemEval-2020 task 3. That is predicting the scores of similarity for a pair of words within two different contexts. Our system is based on both BERT embeddings and WordNet. We simply use cosine similarity to find the closest synset of the target words. Our results show that using this simple approach greatly improves the system behavior. Our model is ranked 3rd in subtask-2 for SemEval-2020 task 3.
This article describes some unsupervised strategies submitted to SemEval 2020 Task 3, a task which consists of considering the effect of context to compute word similarity. More precisely, given two words in context, the system must predict the degree of similarity of those words considering the context in which they occur, and the system score is compared against human prediction. We compare one approach based on pre-trained BERT models with other strategy relying on static word embeddings and syntactic dependencies. The BERT-based method clearly outperformed the dependency-based strategy.
Word similarity is widely used in machine learning applications like searching engine and recommendation. Measuring the changing meaning of the same word between two different sentences is not only a way to handle complex features in word usage (such as sentence syntax and semantics), but also an important method for different word polysemy modeling. In this paper, we present the methodology proposed by team Ferryman. Our system is based on the Bidirectional Encoder Representations from Transformers (BERT) model combined with term frequency-inverse document frequency (TF-IDF), applying the method on the provided datasets called CoSimLex, which covers four different languages including English, Croatian, Slovene, and Finnish. Our team Ferryman wins the the first position for English task and the second position for Finnish in the subtask 1.
In this paper, we present our system for SemEval-2020 task 3, Predicting the (Graded) Effect of Context in Word Similarity. Due to the unsupervised nature of the task, we concentrated on inquiring about the similarity measures induced by different layers of different pre-trained Transformer-based language models, which can be good approximations of the human sense of word similarity. Interestingly, our experiments reveal a language-independent characteristic: the middle to upper layers of Transformer-based language models can induce good approximate similarity measures. Finally, our system was ranked 1st on the Slovenian part of Subtask1 and 2nd on the Croatian part of both Subtask1 and Subtask2.
There is a growing research interest in studying word similarity. Without a doubt, two similar words in a context may considered different in another context. Therefore, this paper investigates the effect of the context in word similarity. The SemEval-2020 workshop has provided a shared task (Task 3: Predicting the (Graded) Effect of Context in Word Similarity). In this task, the organizers provided unlabeled datasets for four languages, English, Croatian, Finnish and Slovenian. Our team, JUSTMasters, has participated in this competition in the two subtasks: A and B. Our approach has used a weighted average ensembling method for different pretrained embeddings techniques for each of the four languages. Our proposed model outperformed the baseline models in both subtasks and acheived the best result for subtask 2 in English and Finnish, with score 0.725 and 0.68 respectively. We have been ranked the sixth for subtask 1, with scores for English, Croatian, Finnish, and Slovenian as follows: 0.738, 0.44, 0.546, 0.512.
Natural Language Processing (NLP) has been widely used in the semantic analysis in recent years. Our paper mainly discusses a methodology to analyze the effect that context has on human perception of similar words, which is the third task of SemEval 2020. We apply several methods in calculating the distance between two embedding vector generated by Bidirectional Encoder Representation from Transformer (BERT). Our team will go won the 1st place in Finnish language track of subtask1, the second place in English track of subtask1.
In this paper, we present SemEval-2020 Task 4, Commonsense Validation and Explanation (ComVE), which includes three subtasks, aiming to evaluate whether a system can distinguish a natural language statement that makes sense to humans from one that does not, and provide the reasons. Specifically, in our first subtask, the participating systems are required to choose from two natural language statements of similar wording the one that makes sense and the one does not. The second subtask additionally asks a system to select the key reason from three options why a given statement does not make sense. In the third subtask, a participating system needs to generate the reason automatically. 39 teams submitted their valid systems to at least one subtask. For Subtask A and Subtask B, top-performing teams have achieved results closed to human performance. However, for Subtask C, there is still a considerable gap between system and human performance. The dataset used in our task can be found at https://github.com/wangcunxiang/SemEval2020-Task4-Commonsense-Validation-and-Explanation.
We present a counterfactual recognition (CR) task, the shared Task 5 of SemEval-2020. Counterfactuals describe potential outcomes (consequents) produced by actions or circumstances that did not happen or cannot happen and are counter to the facts (antecedent). Counterfactual thinking is an important characteristic of the human cognitive system; it connects antecedents and consequent with causal relations. Our task provides a benchmark for counterfactual recognition in natural language with two subtasks. Subtask-1 aims to determine whether a given sentence is a counterfactual statement or not. Subtask-2 requires the participating systems to extract the antecedent and consequent in a given counterfactual statement. During the SemEval-2020 official evaluation period, we received 27 submissions to Subtask-1 and 11 to Subtask-2. Our data and baseline code are made publicly available at https://zenodo.org/record/3932442. The task website and leaderboard can be found at https://competitions.codalab.org/competitions/21691.
Research on definition extraction has been conducted for well over a decade, largely with significant constraints on the type of definitions considered. In this work, we present DeftEval, a SemEval shared task in which participants must extract definitions from free text using a term-definition pair corpus that reflects the complex reality of definitions in natural language. Definitions and glosses in free text often appear without explicit indicators, across sentences boundaries, or in an otherwise complex linguistic manner. DeftEval involved 3 distinct subtasks: 1) Sentence classification, 2) sequence labeling, and 3) relation extraction.
This paper introduces our systems for the first two subtasks of SemEval Task4: Commonsense Validation and Explanation. To clarify the intention for judgment and inject contrastive information for selection, we propose the input reconstruction strategy with prompt templates. Specifically, we formalize the subtasks into the multiple-choice question answering format and construct the input with the prompt templates, then, the final prediction of question answering is considered as the result of subtasks. Experimental results show that our approaches achieve significant performance compared with the baseline systems. Our approaches secure the third rank on both official test sets of the first two subtasks with an accuracy of 96.4 and an accuracy of 94.3 respectively.
We describe our system for Task 5 of SemEval 2020: Modelling Causal Reasoning in Language: Detecting Counterfactuals. Despite deep learning has achieved significant success in many fields, it still hardly drives today’s AI to strong AI, as it lacks of causation, which is a fundamental concept in human thinking and reasoning. In this task, we dedicate to detecting causation, especially counterfactuals from texts. We explore multiple pre-trained models to learn basic features and then fine-tune models with counterfactual data and pseudo-labeling data. Our team HIT-SCIR wins the first place (1st) in Sub-task 1 — Detecting Counterfactual Statements and is ranked 4th in Sub-task 2 — Detecting Antecedent and Consequence. In this paper we provide a detailed description of the approach, as well as the results obtained in this task.
We describe the system submitted to SemEval-2020 Task 6, Subtask 1. The aim of this subtask is to predict whether a given sentence contains a definition or not. Unsurprisingly, we found that strong results can be achieved by fine-tuning a pre-trained BERT language model. In this paper, we analyze the performance of this strategy. Among others, we show that results can be improved by using a two-step fine-tuning process, in which the BERT model is first fine-tuned on the full training set, and then further specialized towards a target domain.
In this paper, we describe our mUlti-task learNIng for cOmmonsense reasoNing (UNION) system submitted for Task C of the SemEval2020 Task 4, which is to generate a reason explaining why a given false statement is non-sensical. However, we found in the early experiments that simple adaptations such as fine-tuning GPT2 often yield dull and non-informative generations (e.g. simple negations). In order to generate more meaningful explanations, we propose UNION, a unified end-to-end framework, to utilize several existing commonsense datasets so that it allows a model to learn more dynamics under the scope of commonsense reasoning. In order to perform model selection efficiently, accurately, and promptly, we also propose a couple of auxiliary automatic evaluation metrics so that we can extensively compare the models from different perspectives. Our submitted system not only results in a good performance in the proposed metrics but also outperforms its competitors with the highest achieved score of 2.10 for human evaluation while remaining a BLEU score of 15.7. Our code is made publicly available.
We participated in all three subtasks. In subtasks A and B, our submissions are based on pretrained language representation models (namely ALBERT) and data augmentation. We experimented with solving the task for another language, Czech, by means of multilingual models and machine translated dataset, or translated model inputs. We show that with a strong machine translation system, our system can be used in another language with a small accuracy loss. In subtask C, our submission, which is based on pretrained sequence-to-sequence model (BART), ranked 1st in BLEU score ranking, however, we show that the correlation between BLEU and human evaluation, in which our submission ended up 4th, is low. We analyse the metrics used in the evaluation and we propose an additional score based on model from subtask B, which correlates well with our manual ranking, as well as reranking method based on the same principle. We performed an error and dataset analysis for all subtasks and we present our findings.
This paper describes our system submitted to task 4 of SemEval 2020: Commonsense Validation and Explanation (ComVE) which consists of three sub-tasks. The task is to directly validate the given sentence whether or not to make sense and require the model to explain it. Based on BERT architecture with the multi-task setting, we propose an effective and interpretable “Explain, Reason and Predict” (ERP) system to solve the three sub-tasks about commonsense: (a) Validation, (b) Reasoning, and (c) Explanation. Inspired by cognitive studies of common sense, our system first generates a reason or understanding of the sentences and then choose which one statement makes sense, which is achieved by multi-task learning. During the post-evaluation, our system has reached 92.9% accuracy in subtask A (rank 11), 89.7% accuracy in subtask B (rank 9), and BLEU score of 12.9 in subtask C (rank 8).
This paper describes our system for SemEval-2020 Task 4: Commonsense Validation and Explanation (Wang et al., 2020). We propose a novel Knowledge-enhanced Graph Attention Network (KEGAT) architecture for this task, leveraging heterogeneous knowledge from both the structured knowledge base (i.e. ConceptNet) and unstructured text to better improve the ability of a machine in commonsense understanding. This model has a powerful commonsense inference capability via utilizing suitable commonsense incorporation methods and upgraded data augmentation techniques. Besides, an internal sharing mechanism is cooperated to prohibit our model from insufficient and excessive reasoning for commonsense. As a result, this model performs quite well in both validation and explanation. For instance, it achieves state-of-the-art accuracy in the subtask called Commonsense Explanation (Multi-Choice). We officially name the system as ECNU-SenseMaker. Code is publicly available at https://github.com/ECNU-ICA/ECNU-SenseMaker.
This paper describes the masked reasoner system that participated in SemEval-2020 Task 4: Commonsense Validation and Explanation. The system participated in the subtask B.We proposes a novel method to fine-tune RoBERTa by masking the most important word in the statement. We believe that the confidence of the system in recovering that word is positively correlated to the score the masked language model gives to the current statement-explanation pair. We evaluate the importance of each word using InferSent and do the masked fine-tuning on RoBERTa. Then we use the fine-tuned model to predict the most plausible explanation. Our system is fast in training and achieved 73.5% accuracy.
The ability of common sense validation and explanation is very important for most models. Most obviously, this will directly affect the rationality of the generated model output. The large amount and diversity of common sense poses great challenges to this task. In addition, many common sense expressions are obscure, thus we need to understand the meaning contained in the vocabulary in order to judge correctly, which further increases the model’s requirements for the accuracy of word representation. The current neural network models are often data-driven, while the annotated data is often limited and requires a lot of manual labeling. In such case, we proposed transfer learning to handle this challenge. From our experiments, we can draw the following three main conclusions: a) Neural language model fully qualified for commonsense validation and explanation. We attribute this to the powerful word and sentence representation capabilities of language models. b) The inconsistency of task of pre-training and fine-tuning will badly hurt the performance. c) A larger amount of corpus and more parameters will enhance the common sense of the model. At the same time, the content of the corpus is equally important.
We describe the system submitted by the SWAGex team to the SemEval-2020 Commonsense Validation and Explanation Task. We use multiple methods on the pre-trained language model BERT (Devlin et al., 2018) for tasks that require the system to recognize sentences against commonsense and justify the reasoning behind this decision. Our best performing model is BERT trained on SWAG and fine-tuned for the task. We investigate the ability to transfer commonsense knowledge from SWAG to SemEval-2020 by training a model for the Explanation task with Next Event Prediction data
SemEval Task 4 Commonsense Validation and Explanation Challenge is to validate whether a system can differentiate natural language statements that make sense from those that do not make sense. Two subtasks, A and B, are focused in this work, i.e., detecting against-common-sense statements and selecting explanations of why they are false from the given options. Intuitively, commonsense validation requires additional knowledge beyond the given statements. Therefore, we propose a system utilising pre-trained sentence transformer models based on BERT, RoBERTa and DistillBERT architectures to embed the statements before classification. According to the results, these embeddings can improve the performance of the typical MLP and LSTM classifiers as downstream models of both subtasks compared to regular tokenised statements. These embedded statements are shown to comprise additional information from external resources which help validate common sense in natural language.
This paper describes BUT-FIT’s submission at SemEval-2020 Task 5: Modelling Causal Reasoning in Language: Detecting Counterfactuals. The challenge focused on detecting whether a given statement contains a counterfactual (Subtask 1) and extracting both antecedent and consequent parts of the counterfactual from the text (Subtask 2). We experimented with various state-of-the-art language representation models (LRMs). We found RoBERTa LRM to perform the best in both subtasks. We achieved the first place in both exact match and F1 for Subtask 2 and ranked second for Subtask 1.
We consider detection of the span of antecedents and consequents in argumentative prose a structural, grammatical task. Our system comprises a set of stacked Bi-LSTMs trained on two complementary linguistic annotations. We explore the effectiveness of grammatical features (POS and clause type) through ablation. The reported experiments suggest that a multi-task learning approach using this external, grammatical knowledge is useful for detecting the extent of antecedents and consequents and performs nearly as well without the use of word embeddings.
In this paper, we describe an approach for modelling causal reasoning in natural language by detecting counterfactuals in text using multi-head self-attention weights. We use pre-trained transformer models to extract contextual embeddings and self-attention weights from the text. We show the use of convolutional layers to extract task-specific features from these self-attention weights. Further, we describe a fine-tuning approach with a common base model for knowledge sharing between the two closely related sub-tasks for counterfactual detection. We analyze and compare the performance of various transformer models in our experiments. Finally, we perform a qualitative analysis with the multi-head self-attention weights to interpret our models’ dynamics.
This paper describes our efforts in tackling Task 5 of SemEval-2020. The task involved detecting a class of textual expressions known as counterfactuals and separating them into their constituent elements. Our final submitted approaches were an ensemble of various fine-tuned transformer-based and CNN-based models for the first subtask and a transformer model with dependency tree information for the second subtask. We ranked 4-th and 9-th in the overall leaderboard. We also explored various other approaches that involved classical methods, other neural architectures and incorporation of different linguistic features.
In this paper, we explore strategies to detect and evaluate counterfactual sentences. We describe our system for SemEval-2020 Task 5: Modeling Causal Reasoning in Language: Detecting Counterfactuals. We use a BERT base model for the classification task and build a hybrid BERT Multi-Layer Perceptron system to handle the sequence identification task. Our experiments show that while introducing syntactic and semantic features does little in improving the system in the classification task, using these types of features as cascaded linear inputs to fine-tune the sequence-delimiting ability of the model ensures it outperforms other similar-purpose complex systems like BiLSTM-CRF in the second task. Our system achieves an F1 score of 85.00% in Task 1 and 83.90% in Task 2.
We describe our contribution to two of the subtasks of SemEval 2020 Task 6, DeftEval: Extracting term-definition pairs in free text. The system for Subtask 1: Sentence Classification is based on a transformer architecture where we use transfer learning to fine-tune a pretrained model on the downstream task, and the one for Subtask 3: Relation Classification uses a Random Forest classifier with handcrafted dedicated features. Our systems respectively achieve 0.830 and 0.994 F1-scores on the official test set, and we believe that the insights derived from our study are potentially relevant to help advance the research on definition extraction.
This paper describes our approach to “DeftEval: Extracting Definitions from Free Text in Textbooks” competition held as a part of Semeval 2020. The task was devoted to finding and labeling definitions in texts. DeftEval was split into three subtasks: sentence classification, sequence labeling and relation classification. Our solution ranked 5th in the first subtask and 23rd and 21st in the second and the third subtasks respectively. We applied simultaneous multi-task learning with Transformer-based models for subtasks 1 and 3 and a single BERT-based model for named entity recognition.
This paper describes our system that participated in the SemEval-2020 task 4: Commonsense Validation and Explanation. For this task, it is obvious that external knowledge, such as Knowledge graph, can help the model understand commonsense in natural language statements. But how to select the right triples for statements remains unsolved, so how to reduce the interference of irrelevant triples on model performance is a research focus. This paper adopt a modified K-BERT as the language encoder, to enhance language representation through triples from knowledge graphs. Experiments show that our method is better than models without external knowledge, and is slightly better than the original K-BERT. We got an accuracy score of 0.97 in subtaskA, ranking 1/45, and got an accuracy score of 0.948, ranking 2/35.
In this paper, we describe our system for Task 4 of SemEval 2020, which involves differentiating between natural language statements that conform to common sense and those that do not. The organizers propose three subtasks - first, selecting between two sentences, the one which is against common sense. Second, identifying the most crucial reason why a statement does not make sense. Third, generating novel reasons for explaining the against common sense statement. Out of the three subtasks, this paper reports the system description of subtask A and subtask B. This paper proposes a model based on transformer neural network architecture for addressing the subtasks. The novelty in work lies in the architecture design, which handles the logical implication of contradicting statements and simultaneous information extraction from both sentences. We use a parallel instance of transformers, which is responsible for a boost in the performance. We achieved an accuracy of 94.8% in subtask A and 89% in subtask B on the test set.
In this paper, we investigate a commonsense inference task that unifies natural language understanding and commonsense reasoning. We describe our attempt at SemEval-2020 Task 4 competition: Commonsense Validation and Explanation (ComVE) challenge. We discuss several state-of-the-art deep learning architectures for this challenge. Our system uses prepared labeled textual datasets that were manually curated for three different natural language inference subtasks. The goal of the first subtask is to test whether a model can distinguish between natural language statements that make sense and those that do not make sense. We compare the performance of several language models and fine-tuned classifiers. Then, we propose a method inspired by question/answering tasks to treat a classification problem as a multiple choice question task to boost the performance of our experimental results (96.06%), which is significantly better than the baseline. For the second subtask, which is to select the reason why a statement does not make sense, we stand within the first six teams (93.7%) among 27 participants with very competitive results. Our result for last subtask of generating reason against the nonsense statement shows many potentials for future researches as we applied the most powerful generative model of language (GPT-2) with 6.1732 BLEU score among first four teams. .
Introducing common sense to natural language understanding systems has received increasing research attention. To facilitate the researches on common sense reasoning, the SemEval-2020 Task 4 Commonsense Validation and Explanation(ComVE) is proposed. We participate in sub-task A and try various methods including traditional machine learning methods, deep learning methods, and also recent pre-trained language models. Finally, we concatenate the original output of BERT and the output vector of BERT hidden layer state to obtain more abundant semantic information features, and obtain competitive results. Our model achieves an accuracy of 0.8510 in the final test data and ranks 25th among all the teams.
This paper describes the results of our team HR@JUST participation at SemEval-2020 Task 4 - Commonsense Validation and Explanation (ComVE) for POST evaluation period. The provided task consists of three sub-tasks, we participate in task A. We considered a state-of-the-art approach for solving this task by performing RoBERTa model with no Next Sentences Prediction (NSP), dynamic masking, larger training data, and larger batch size. The achieved results show that we got the 11th rank on the final test set leaderboard with an accuracy of 91.3%.
This paper presents our contributions to the SemEval-2020 Task 4 Commonsense Validation and Explanation (ComVE) and includes the experimental results of the two Subtasks B and C of the SemEval-2020 Task 4. Our systems rely on pre-trained language models, i.e., BERT (including its variants) and UniLM, and rank 10th and 7th among 27 and 17 systems on Subtasks B and C, respectively. We analyze the commonsense ability of the existing pretrained language models by testing them on the SemEval-2020 Task 4 ComVE dataset, specifically for Subtasks B and C, the explanation subtasks with multi-choice and sentence generation, respectively.
In this paper, we describe our team’s (JUSTers) effort in the Commonsense Validation and Explanation (ComVE) task, which is part of SemEval2020. We evaluate five pre-trained Transformer-based language models with various sizes against the three proposed subtasks. For the first two subtasks, the best accuracy levels achieved by our models are 92.90% and 92.30%, respectively, placing our team in the 12th and 9th places, respectively. As for the last subtask, our models reach 16.10 BLEU score and 1.94 human evaluation score placing our team in the 5th and 3rd places according to these two metrics, respectively. The latter is only 0.16 away from the 1st place human evaluation score.
This paper presents our strategies in SemEval 2020 Task 4: Commonsense Validation and Explanation. We propose a novel way to search for evidence and choose the different large-scale pre-trained models as the backbone for three subtasks. The results show that our evidence-searching approach improves model performance on commonsense explanation task. Our team ranks 2nd in subtask C according to human evaluation score.
Using a natural language understanding system for commonsense comprehension is getting increasing attention from researchers. Current multi-purpose state-of-the-art models suffer on commonsense validation and explanation tasks. We have adopted one of the state-of-the-art models and proposing a method to boost the performance of the model in commonsense related tasks.
This article describes the system submitted to SemEval 2020 Task 4: Commonsense Validation and Explanation. We only participated in the subtask A, which is mainly to distinguish whether the sentence has meaning. To solve this task, we mainly used ALBERT model-based maximum ensemble with different training sizes and depths. To prove the validity of the model to the task, we also used some other neural network models for comparison. Our model achieved the accuracy score of 0.938(ranked 10/41) in subtask A.
This paper introduces our system for commonsense validation and explanation. For Sen-Making task, we use a novel pretraining language model based architecture to pick out one of the two given statements that is againstcommon sense. For Explanation task, we use a hint sentence mechanism to improve the performance greatly. In addition, we propose a subtask level transfer learning to share information between subtasks.
In this paper, we explore solutions to a common sense making task in which a model must discern which of two sentences is against common sense. We used a pre-trained language model which we used to calculate complexity scores for input to discern which sentence contained an unlikely sequence of tokens. Other approaches we tested were word vector distances, which were used to find semantic outliers within a sentence, and siamese network. By using the pre-trained language model to calculate perplexity scores based on the sequence of tokens in input sentences, we achieved an accuracy of 75 percent.
This paper presents the work of the NLP@JUST team at SemEval-2020 Task 4 competition that related to commonsense validation and explanation (ComVE) task. The team participates in sub-taskA (Validation) which related to validation that checks if the text is against common sense or not. Several models have trained (i.e. Bert, XLNet, and Roberta), however, the main models used are the RoBERTa-large and BERT Whole word masking. As well as, we utilized the results from both models to generate final prediction by using the average Ensemble technique, that used to improve the overall performance. The evaluation result shows that the implemented model achieved an accuracy of 93.9% obtained and published at the post-evaluation result on the leaderboard.
Common sense validation deals with testing whether a system can differentiate natural language statements that make sense from those that do not make sense. This paper describes the our approach to solve this challenge. For common sense validation with multi choice, we propose a stacking based approach to classify sentences that are more favourable in terms of common sense to the particular statement. We have used majority voting classifier methodology amongst three models such as Bidirectional Encoder Representations from Transformers (BERT), Micro Text Classification (Micro TC) and XLNet. For sentence generation, we used Neural Machine Translation (NMT) model to generate explanatory sentences.
In this paper, we present our submission for SemEval 2020 Task 4 - Commonsense Validation and Explanation (ComVE). The objective of this task was to develop a system that can differentiate statements that make sense from the ones that don’t. ComVE comprises of three subtasks to challenge and test a system’s capability in understanding commonsense knowledge from various dimensions. Commonsense reasoning is a challenging task in the domain of natural language understanding and systems augmented with it can improve performance in various other tasks such as reading comprehension, and inferencing. We have developed a system that leverages commonsense knowledge from pretrained language models trained on huge corpus such as RoBERTa, GPT2, etc. Our proposed system validates the reasonability of a given statement against the backdrop of commonsense knowledge acquired by these models and generates a logical reason to support its decision. Our system ranked 2nd in subtask C with a BLEU score of 19.3, which by far is the most challenging subtask as it required systems to generate the rationale behind the choice of an unreasonable statement. In subtask A and B, we achieved 96% and 94% accuracy respectively standing at 4th position in both the subtasks.
Common sense for natural language processing methods has been attracting a wide research interest, recently. Estimating automatically whether a sentence makes sense or not is considered an essential question. Task 4 in the International Workshop SemEval 2020 has provided three subtasks (A, B, and C) that challenges the participants to build systems for distinguishing the common sense statements from those that do not make sense. This paper describes TeamJUST’s approach for participating in subtask A to differentiate between two sentences in English and classify them into two classes: common sense and uncommon sense statements. Our approach depends on ensembling four different state-of-the-art pre-trained models (BERT, ALBERT, Roberta, and XLNet). Our baseline model which we used only the pre-trained model of BERT has scored 89.1, while the TeamJUST model outperformed the baseline model with an accuracy score of 96.2. We have improved the results in the post-evaluation period to achieve our best result, which would rank the 4th in the competition if we had the chance to use our latest experiment.
In this paper, we present our submission for subtask A of the Common Sense Validation and Explanation (ComVE) shared task. We examine the ability of large-scale pre-trained language models to distinguish commonsense from non-commonsense statements. We also explore the utility of external resources that aim to supplement the world knowledge inherent in such language models, including commonsense knowledge graph embedding models, word concreteness ratings, and text-to-image generation models. We find that such resources provide insignificant gains to the performance of fine-tuned language models. We also provide a qualitative analysis of the limitations of the language model fine-tuned to this task.
Commonsense Validation and Explanation has been a difficult task for machines since the dawn of computing. Although very trivial to humans it poses a high complexity for machines due to the necessity of inference over a pre-existing knowledge base. In order to try and solve this problem the SemEval 2020 Task 4 - ”Commonsense Validation and Explanation (ComVE)” aims to evaluate systems capable of multiple stages of ComVE. The challenge includes 3 tasks (A, B and C), each with it’s own requirements. Our team participated only in task A which required selecting the statement that made the least sense. We choose to use a bidirectional transformer in order to solve the challenge, this paper presents the details of our method, runs and result.
This paper describes our submissions into the ComVe challenge, the SemEval 2020 Task 4. This evaluation task consists of three sub-tasks that test commonsense comprehension by identifying sentences that do not make sense and explain why they do not. In subtask A, we use Roberta to find which sentence does not make sense. In subtask B, besides using BERT, we also experiment with replacing the dataset with MNLI when selecting the best explanation from the provided options why the given sentence does not make sense. In subtask C, we utilize the MNLI model from subtask B to evaluate the explanation generated by Roberta and GPT-2 by exploiting the contradiction of the sentence and their explanation. Our system submission records a performance of 88.2%, 80.5%, and BLEU 5.5 for those three subtasks, respectively.
This paper describes our system in subtask A of SemEval 2020 Shared Task 4. We propose a reinforcement learning model based on MTL(Multi-Task Learning) to enhance the prediction ability of commonsense validation. The experimental results demonstrate that our system outperforms the single-task text classification model. We combine MTL and ALBERT pretrain model to achieve an accuracy of 0.904 and our model is ranked 16th on the final leader board of the competition among the 45 teams.
This paper describes the system and results of our team participated in SemEval-2020 Task4: Commonsense Validation and Explanation (ComVE), which aim to distinguish meaningful natural language statements from unreasonable natural language statements. This task contains three subtasks: Subtask A–Validation, Subtask B–Explanation (Multi-Choice), and Subtask C– Explanation (Generation). In these three subtasks, we only participated in Subtask A, which aims to distinguish whether a given two natural language statements with similar wording are meaningful. To solve this problem, we proposed a method using a combination of BERT with the Bidirectional Gated Recurrent Unit (Bi-GRU). Our model achieved an accuracy of 0.836 in Subtask A (ranked 27/45).
Counterfactuals describe events counter to facts and hence naturally involve common sense, knowledge, and reasoning. SemEval 2020 task 5 is focusing on this field. We participate in the subtask 1 and we use BERT as our system. Our Innovations are feature extraction and data augmentation. We extract and summarize features of counterfactual statements, augment counterfactual examples in training set with the help of these features, and two general methods of data augmentation is experimented in our work. We demonstrate the effectiveness of our approaches, which achieves 0.95 of subtask 1 in F1 while using only a subset of giving training set to fine-tune the BERT model, and our official submission achieves F1 0.802, which ranks us 16th in the competition.
We participate in the classification tasks of SemEval-2020 Task: Subtask1: Detecting counterfactual statements of semeval-2020 task5(Detecting Counterfactuals). This paper examines different approaches and models towards detecting counterfactual statements classification. We choose the Bert model. However, the output of Bert is not a good summary of semantic information, so in order to obtain more abundant semantic information features, we modify the upper layer structure of Bert. Finally, our system achieves an accuracy of 88.90% and F1 score of 86.30% by hard voting, which ranks 6th on the final leader board of the in subtask 1 competition.
I present ETHAN: Experimental Testing of Hybrid AI Node implemented entirely on free cloud computing infrastructure. The ultimate goal of this research is to create modular reusable hybrid neuro-symbolic architecture for Artificial Intelligence. As a test case I model natural language comprehension of causal relations from open domain text corpus that combines semi-supervised language model (Huggingface Transformers) with constituency and dependency parsers (Allen Institute for Artificial Intelligence.)
The main purpose of this article is to state the effect of using different methods and models for counterfactual determination and detection of causal knowledge. Nowadays, counterfactual reasoning has been widely used in various fields. In the realm of natural language process(NLP), counterfactual reasoning has huge potential to improve the correctness of a sentence. In the shared Task 5 of detecting counterfactual in SemEval 2020, we pre-process the officially given dataset according to case conversion, extract stem and abbreviation replacement. We use last-5 bidirectional encoder representation from bidirectional encoder representation from transformer (BERT)and term frequency–inverse document frequency (TF-IDF) vectorizer for counterfactual detection. Meanwhile, multi-sample dropout and cross validation are used to improve versatility and prevent problems such as poor generosity caused by overfitting. Finally, our team Ferryman ranked the 8th place in the sub-task 1 of this competition.
ISCAS participated in two subtasks of SemEval 2020 Task 5: detecting counterfactual statements and detecting antecedent and consequence. This paper describes our system which is based on pretrained transformers. For the first subtask, we train several transformer-based classifiers for detecting counterfactual statements. For the second subtask, we formulate antecedent and consequence extraction as a query-based question answering problem. The two subsystems both achieved third place in the evaluation. Our system is openly released at https://github.com/casnlu/ISCASSemEval2020Task5.
This article describes the system submitted to SemEval 2020 Task 5: Modelling Causal Reasoning in Language: Detecting Counterfactuals. In this task, we only participate in the subtask A which is detecting counterfactual statements. In order to solve this sub-task, first of all, because of the problem of data balance, we use the undersampling and oversampling methods to process the data set. Second, we used the ALBERT model and the maximum ensemble method based on the ALBERT model. Our methods achieved a F1 score of 0.85 in subtask A.
In this article, we try to solve the problem of classification of counterfactual statements and extraction of antecedents/consequences in raw data, by mobilizing on one hand Support vector machine (SVMs) and on the other hand Natural Language Understanding (NLU) infrastructures available on the market for conversational agents. Our experiments allowed us to test different pipelines of two known platforms (Snips NLU and Rasa NLU). The results obtained show that a Rasa NLU pipeline, built with a well-preprocessed dataset and tuned algorithms, allows to model accurately the structure of a counterfactual event, in order to facilitate the identification and the extraction of its components.
This paper presents the deep-learning model that is submitted to the SemEval-2020 Task 5 competition: “Detecting Counterfactuals”. We participated in both Subtask1 and Subtask2. The model proposed in this paper ranked 2nd in Subtask2 “Detecting antecedent and consequence”. Our model approaches the task as a sequence labeling. The architecture is built on top of BERT, and a multi-head attention layer with label masking is used to benefit from the mutual information between nearby labels. Also, for prediction, a multi-stage algorithm is used in which the model finalize some predictions with higher certainty in each step and use them in the following. Our results show that masking the labels not only is an efficient regularization method but also improves the accuracy of the model compared with other alternatives like CRF. Label masking can be used as a regularization method in sequence labeling. Also, it improves the performance of the model by learning the specific patterns in the target variable.
This paper describes the system and results of our team’s participation in SemEval-2020 Task5: Modelling Causal Reasoning in Language: Detecting Counterfactuals, which aims to simulate counterfactual semantics and reasoning in natural language. This task contains two subtasks: Subtask1–Detecting counterfactual statements and Subtask2–Detecting antecedent and consequence. We only participated in Subtask1, aiming to determine whether a given sentence is counterfactual. In order to solve this task, we proposed a system based on Ordered Neurons LSTM (ON-LSTM) with Hierarchical Attention Network (HAN) and used Pooling operation for dimensionality reduction. Finally, we used the K-fold approach as the ensemble method. Our model achieved an F1 score of 0.7040 in Subtask1 (Ranked 16/27).
Definition extraction is an important task in Nature Language Processing, and it is used to identify the terms and definitions related to terms. The task contains sentence classification task (i.e., classify whether it contains definition) and sequence labeling task (i.e., find the boundary of terms and definitions). The paper describes our system BERTatDE1 in sentence classification task (subtask 1) and sequence labeling task (subtask 2) in the definition extraction (SemEval-2020 Task 6). We use BERT to solve the multi-domain problems including the uncertainty of term boundary that is, different areas have different ways to definite the domain related terms. We use BERT, BiLSTM and attention in subtask 1 and our best result achieved 79.71% in F1 and the eighteenth place in subtask 1. For the subtask 2, we use BERT, BiLSTM and CRF to sequence labeling, and achieve 40.73% in Macro-averaged F1.
This paper describes participation in DeftEval 2020 (part of SemEval sharing task competition), and is focused on the sentence classification. Our approach to the task was to create an ensemble of several RNNs combined with fasttext and ELMo embeddings. Results show that various types of models in an ensemble give a performance boost in comparison to standard models. Our model achieved F1-score of 78% for a positive class on the DeftEval dataset.
Definition Extraction systems are a valuable knowledge source for both humans and algorithms. In this paper we describe our submissions to the DeftEval shared task (SemEval-2020 Task 6), which is evaluated on an English textbook corpus. We provide a detailed explanation of our system for the joint extraction of definition concepts and the relations among them. Furthermore we provide an ablation study of our model variations and describe the results of an error analysis.
We explore the performance of Bidirectional Encoder Representations from Transformers (BERT) at definition extraction. We further propose a joint model of BERT and Text Level Graph Convolutional Network so as to incorporate dependencies into the model. Our proposed model produces better results than BERT and achieves comparable results to BERT with fine tuned language model in DeftEval (Task 6 of SemEval 2020), a shared task of classifying whether a sentence contains a definition or not (Subtask 1).
This paper presents the RGCL team submission to SemEval 2020 Task 6: DeftEval, subtasks 1 and 2. The system classifies definitions at the sentence and token levels. It utilises state-of-the-art neural network architectures, which have some task-specific adaptations, including an automatically extended training set. Overall, the approach achieves acceptable evaluation scores, while maintaining flexibility in architecture selection.
We describe our system (TüKaPo) submitted for Task 6: DeftEval, at SemEval 2020. We developed a hybrid approach that combined existing CNN and RNN methods and investigated the impact of purely-syntactic and semantic features on the task of definition extraction. Our final model achieved a F1-score of 0.6851 in subtask 1, i.e, sentence classification.
Definition Extraction is the task to automatically extract terms and their definitions from text. In recent years, it attracts wide interest from NLP researchers. This paper describes the unixlong team’s system for the SemEval 2020 task6: DeftEval: Extracting term-definition pairs in free text. The goal of this task is to extract definition, word level BIO tags and relations. This task is challenging due to the free style of the text, especially the definitions of the terms range across several sentences and lack explicit verb phrases. We propose a joint model to train the tasks of definition extraction and the word level BIO tagging simultaneously. We design a creative format input of BERT to capture the location information between entity and its definition. Then we adjust the result of BERT with some rules. Finally, we apply TAG_ID, ROOT_ID, BIO tag to predict the relation and achieve macro-averaged F1 score 1.0 which rank first on the official test set in the relation extraction subtask.
This work presents our contribution in the context of the 6th task of SemEval-2020: Extracting Definitions from Free Text in Textbooks (DeftEval). This competition consists of three subtasks with different levels of granularity: (1) classification of sentences as definitional or non-definitional, (2) labeling of definitional sentences, and (3) relation classification. We use various pretrained language models (i.e., BERT, XLNet, RoBERTa, SciBERT, and ALBERT) to solve each of the three subtasks of the competition. Specifically, for each language model variant, we experiment by both freezing its weights and fine-tuning them. We also explore a multi-task architecture that was trained to jointly predict the outputs for the second and the third subtasks. Our best performing model evaluated on the DeftEval dataset obtains the 32nd place for the first subtask and the 37th place for the second subtask. The code is available for further research at: https://github.com/avramandrei/DeftEval
This paper describes the SemEval-2020 shared task “Assessing Humor in Edited News Headlines.” The task’s dataset contains news headlines in which short edits were applied to make them funny, and the funniness of these edited headlines was rated using crowdsourcing. This task includes two subtasks, the first of which is to estimate the funniness of headlines on a humor scale in the interval 0-3. The second subtask is to predict, for a pair of edited versions of the same original headline, which is the funnier version. To date, this task is the most popular shared computational humor task, attracting 48 teams for the first subtask and 31 teams for the second.
Information on social media comprises of various modalities such as textual, visual and audio. NLP and Computer Vision communities often leverage only one prominent modality in isolation to study social media. However, computational processing of Internet memes needs a hybrid approach. The growing ubiquity of Internet memes on social media platforms such as Facebook, Instagram, and Twitter further suggests that we can not ignore such multimodal content anymore. To the best of our knowledge, there is not much attention towards meme emotion analysis. The objective of this proposal is to bring the attention of the research community towards the automatic processing of Internet memes. The task Memotion analysis released approx 10K annotated memes- with human annotated labels namely sentiment(positive, negative, neutral), type of emotion(sarcastic,funny,offensive, motivation) and their corresponding intensity. The challenge consisted of three subtasks: sentiment (positive, negative, and neutral) analysis of memes,overall emotion (humor, sarcasm, offensive, and motivational) classification of memes, and classifying intensity of meme emotion. The best performances achieved were F1 (macro average) scores of 0.35, 0.51 and 0.32, respectively for each of the three subtasks.
In this paper, we present the results of the SemEval-2020 Task 9 on Sentiment Analysis of Code-Mixed Tweets (SentiMix 2020). We also release and describe our Hinglish (Hindi-English)and Spanglish (Spanish-English) corpora annotated with word-level language identification and sentence-level sentiment labels. These corpora are comprised of 20K and 19K examples, respectively. The sentiment labels are - Positive, Negative, and Neutral. SentiMix attracted 89 submissions in total including 61 teams that participated in the Hinglish contest and 28 submitted systems to the Spanglish competition. The best performance achieved was 75.0% F1 score for Hinglish and 80.6% F1 for Spanglish. We observe that BERT-like models and ensemble methods are the most common and successful approaches among the participants.
This paper describes the winning system for SemEval-2020 task 7: Assessing Humor in Edited News Headlines. Our strategy is Stacking at Scale (SaS) with heterogeneous pre-trained language models (PLMs) such as BERT and GPT-2. SaS first performs fine-tuning on numbers of PLMs with various hyperparameters and then applies a powerful stacking ensemble on top of the fine-tuned PLMs. Our experimental results show that SaS outperforms a naive average ensemble, leveraging weaker PLMs as well as high-performing PLMs. Interestingly, the results show that SaS captured non-funny semantics. Consequently, the system was ranked 1st in all subtasks by significant margins compared with other systems.
This paper presents our submission to task 8 (memotion analysis) of the SemEval 2020 competition. We explain the algorithms that were used to learn our models along with the process of tuning the algorithms and selecting the best model. Since meme analysis is a challenging task with two distinct modalities, we studied the impact of different multimodal representation strategies. The results of several approaches to dealing with multimodal data are therefore discussed in the paper. We found that alignment-based strategies did not perform well on memes. Our quantitative results also showed that images and text were uncorrelated. Fusion-based strategies did not show significant improvements and using one modality only (text or image) tends to lead to better results when applied with the predictive models that we used in our research.
Code switching is a linguistic phenomenon which may occur within a multilingual setting where speakers share more than one language. With the increasing communication between groups with different languages, this phenomenon is more and more popular. However, there are little research and data in this area, especially in code-mixing sentiment classification. In this work, the domain transfer learning from state-of-the-art uni-language model ERNIE is tested on the code-mixing dataset, and surprisingly, a strong baseline is achieved. And further more, the adversarial training with a multi-lingual model is used to achieved 1st place of SemEval-2020 Task9 Hindi-English sentiment classification competition.
This paper describes a system that aims at assessing humour intensity in edited news headlines as part of the 7th task of SemEval-2020 on “Humor, Emphasis and Sentiment”. Various factors need to be accounted for in order to assess the funniness of an edited headline. We propose an architecture that uses hand-crafted features, knowledge bases and a language model to understand humour, and combines them in a regression model. Our system outperforms two baselines. In general, automatic humour assessment remains a difficult task.
This paper describes our system submission Hasyarasa for the SemEval-2020 Task-7: Assessing Humor in Edited News Headlines. This task has two subtasks. The goal of Subtask 1 is to predict the mean funniness of the edited headline given the original and the edited headline. In Subtask 2, given two edits on the original headline, the goal is to predict the funnier of the two. We observed that the departure from expected state/ actions of situations/ individuals is the cause of humor in the edited headlines. We propose two novel features: Contextual Semantic Distance and Contextual Neighborhood Distance to estimate this departure and thus capture the contextual absurdity and hence the humor in the edited headlines. We have used these features together with a Bi-LSTM Attention based model and have achieved 0.53310 RMSE for Subtask 1 and 60.19% accuracy for Subtask 2.
This paper describes our system that was designed for Humor evaluation within the SemEval-2020 Task 7. The system is based on convolutional neural network architecture. We investigate the system on the official dataset, and we provide more insight to model itself to see how the learned inner features look.
This paper describes our contribution to SemEval-2020 Task 7: Assessing Humor in Edited News Headlines. Here we present a method based on a deep neural network. In recent years, quite some attention has been devoted to humor production and perception. Our team KDEhumor employs recurrent neural network models including Bi-Directional LSTMs (BiLSTMs). Moreover, we utilize the state-of-the-art pre-trained sentence embedding techniques. We analyze the performance of our method and demonstrate the contribution of each component of our architecture.
In this paper, we assess the ability of BERT and its derivative models (RoBERTa, DistilBERT, and ALBERT) for short-edits based humor grading. We test these models for humor grading and classification tasks on the Humicroedit and the FunLines dataset. We perform extensive experiments with these models to test their language modeling and generalization abilities via zero-shot inference and cross-dataset inference based approaches. Further, we also inspect the role of self-attention layers in humor-grading by performing a qualitative analysis over the self-attention weights from the final layer of the trained BERT model. Our experiments show that all the pre-trained BERT derivative models show significant generalization capabilities for humor-grading related tasks.
Assessing the funniness of edited news headlines task deals with estimating the humorness in the headlines edited with micro-edits. This task has two sub-tasks in which one has to calculate the mean predicted score of humor level and other deals with predicting the best funnier sentence among given two sentences. We have calculated the humorness level using microtc and predicted the funnier sentence using microtc, universal sentence encoder classifier, many other traditional classifiers that use the vectors formed with universal sentence encoder embeddings, sentence embeddings and majority algorithm within these approaches. Among these approaches, microtc with 6 folds, 24 processes and 3 folds, 36 processes achieve the least Root Mean Square Error for development and test set respectively for subtask 1. For subtask 2, Universal sentence encoder classifier achieves the highest accuracy for development set and Multi-Layer Perceptron applied on vectors vectorized using universal sentence encoder embeddings for the test set.
This paper describes an ensemble model designed for Semeval-2020 Task 7. The task is based on the Humicroedit dataset that is comprised of news titles and one-word substitutions designed to make them humorous. We use BERT, FastText, Elmo, and Word2Vec to encode these titles then pass them to a bidirectional gated recurrent unit (BiGRU) with attention. Finally, we used XGBoost on the concatenation of the results of the different models to make predictions.
Memes have become an ubiquitous social media entity and the processing and analysis of such multimodal data is currently an active area of research. This paper presents our work on the Memotion Analysis shared task of SemEval 2020, which involves the sentiment and humor analysis of memes. We propose a system which uses different bimodal fusion techniques to leverage the inter-modal dependency for sentiment and humor classification tasks. Out of all our experiments, the best system improved the baseline with macro F1 scores of 0.357 on Sentiment Classification (Task A), 0.510 on Humor Classification (Task B) and 0.312 on Scales of Semantic Classes (Task C).
In this paper, we present a multimodal architecture to determine the emotion expressed in a meme. This architecture utilizes both textual and visual information present in a meme. To extract image features we experimented with pre-trained VGG-16 and Inception-V3 classifiers and to extract text features we used LSTM and BERT classifiers. Both FastText and GloVe embeddings were experimented with for the LSTM classifier. The best F1 scores our classifier obtained on the official analysis results are 0.3309, 0.4752, and 0.2897 for Task A, B, and C respectively in the Memotion Analysis task (Task 8) organized as part of International Workshop on Semantic Evaluation 2020 (SemEval 2020). In our study, we found that combining both textual and visual information expressed in a meme improves the performance of the classifier as opposed to using standalone classifiers that use only text or visual data.
We propose hybrid models (HybridE and HybridW) for meme analysis (SemEval 2020 Task 8), which involves sentiment classification (Subtask A), humor classification (Subtask B), and scale of semantic classes (Subtask C). The hybrid model consists of BLSTM and CNN for text and image processing respectively. HybridE provides equal weight to BLSTM and CNN performance, while HybridW provides weightage based on the performance of BLSTM and CNN on a validation set. The performances (macro F1) of our hybrid model on Subtask A are 0.329 (HybridE), 0.328 (HybridW), on Subtask B are 0.507 (HybridE), 0.512 (HybridW), and on Subtask C are 0.309 (HybridE), 0.311 (HybridW).
This paper describes our contribution to SemEval 2020 Task 8: Memotion Analysis. Our system learns multi-modal embeddings from text and images in order to classify Internet memes by sentiment. Our model learns text embeddings using BERT and extracts features from images with DenseNet, subsequently combining both features through concatenation. We also compare our results with those produced by DenseNet, ResNet, BERT, and BERT-ResNet. Our results show that image classification models have the potential to help classifying memes, with DenseNet outperforming ResNet. Adding text features is however not always helpful for Memotion Analysis.
This paper describes the system submitted by the PRHLT-UPV team for the task 8 of SemEval-2020: Memotion Analysis. We propose a multimodal model that combines pretrained models of the BERT and VGG architectures. The BERT model is used to process the textual information and VGG the images. The multimodal model is used to classify memes according to the presence of offensive, sarcastic, humorous and motivating content. Also, a sentiment analysis of memes is carried out with the proposed model. In the experiments, the model is compared with other approaches to analyze the relevance of the multimodal model. The results show encouraging performances on the final leaderboard of the competition, reaching good positions in the ranking of systems.
this paper proposed a parallel-channel model to process the textual and visual information in memes and then analyze the sentiment polarity of memes. In the shared task of identifying and categorizing memes, we preprocess the dataset according to the language behaviors on social media. Then, we adapt and fine-tune the Bidirectional Encoder Representations from Transformers (BERT), and two types of convolutional neural network models (CNNs) were used to extract the features from the pictures. We applied an ensemble model that combined the BiLSTM, BIGRU, and Attention models to perform cross domain suggestion mining. The officially released results show that our system performs better than the baseline algorithm
The growing popularity and applications of sentiment analysis of social media posts has naturally led to sentiment analysis of posts written in multiple languages, a practice known as code-switching. While recent research into code-switched posts has focused on the use of multilingual word embeddings, these embeddings were not trained on code-switched data. In this work, we present word-embeddings trained on code-switched tweets, specifically those that make use of Spanish and English, known as Spanglish. We explore the embedding space to discover how they capture the meanings of words in both languages. We test the effectiveness of these embeddings by participating in SemEval 2020 Task 9: Sentiment Analysis on Code-Mixed Social Media Text. We utilised them to train a sentiment classifier that achieves an F-1 score of 0.722. This is higher than the baseline for the competition of 0.656, with our team (codalab username francesita) ranking 14 out of 29 participating teams, beating the baseline.
The “Sentiment Analysis for Code-Mixed Social Media Text” task at the SemEval 2020 competition focuses on sentiment analysis in code-mixed social media text , specifically, on the combination of English with Spanish (Spanglish) and Hindi (Hinglish). In this paper, we present a system able to classify tweets, from Spanish and English languages, into positive, negative and neutral. Firstly, we built a classifier able to provide corresponding sentiment labels. Besides the sentiment labels, we provide the language labels at the word level. Secondly, we generate a word-level representation, using Convolutional Neural Network (CNN) architecture. Our solution indicates promising results for the Sentimix Spanglish-English task (0.744), the team, Lavinia_Ap, occupied the 9th place. However, for the Sentimix Hindi-English task (0.324) the results have to be improved.
Sentiment analysis for code-mixed social media text continues to be an under-explored area. This work adds two common approaches: fine-tuning large transformer models and sample efficient methods like ULMFiT. Prior work demonstrates the efficacy of classical ML methods for polarity detection. Fine-tuned general-purpose language representation models, such as those of the BERT family are benchmarked along with classical machine learning and ensemble methods. We show that NB-SVM beats RoBERTa by 6.2% (relative) F1. The best performing model is a majority-vote ensemble which achieves an F1 of 0.707. The leaderboard submission was made under the codalab username nirantk, with F1 of 0.689.
It is fairly common to use code-mixing on a social media platform to express opinions and emotions in multilingual societies. The purpose of this task is to detect the sentiment of code-mixed social media text. Code-mixed text poses a great challenge for the traditional NLP system, which currently uses monolingual resources to deal with the problem of multilingual mixing. This task has been solved in the past using lexicon lookup in respective sentiment dictionaries and using a long short-term memory (LSTM) neural network for monolingual resources. In this paper, we present a system that uses a bilingual vector gating mechanism for bilingual resources to complete the task. The model consists of two main parts: the vector gating mechanism, which combines the character and word levels, and the attention mechanism, which extracts the important emotional parts of the text. The results show that the proposed system outperforms the baseline algorithm. We achieved fifth place in Spanglish and 19th place in Hinglish.
In this paper, we present the results that the team IIITG-ADBU (codalab username ‘abaruah’) obtained in the SentiMix task (Task 9) of the International Workshop on Semantic Evaluation 2020 (SemEval 2020). This task required the detection of sentiment in code-mixed Hindi-English tweets. Broadly, we performed two sets of experiments for this task. The first experiment was performed using the multilingual BERT classifier and the second set of experiments was performed using SVM classifiers. The character-based SVM classifier obtained the best F1 score of 0.678 in the test set with a rank of 21 among 62 participants. The performance of the multilingual BERT classifier was quite comparable with the SVM classifier on the development set. However, on the test set it obtained an F1 score of 0.342.
In this paper, we present our system for the SemEval 2020 task on code-mixed sentiment analysis. Our system makes use of large transformer based multilingual embeddings like mBERT. Recent work has shown that these models posses the ability to solve code-mixed tasks in addition to their originally demonstrated cross-lingual abilities. We evaluate the stock versions of these models for the sentiment analysis task and also show that their performance can be improved by using unlabelled code-mixed data. Our submission (username Genius1237) achieved the second rank on the English-Hindi subtask with an F1 score of 0.726.
Code-switching is a phenomenon in which two or more languages are used in the same message. Nowadays, it is quite common to find messages with languages mixed in social media. This phenomenon presents a challenge for sentiment analysis. In this paper, we use a standard convolutional neural network model to predict the sentiment of tweets in a blend of Spanish and English languages. Our simple approach achieved a F1-score of 0:71 on test set on the competition. We analyze our best model capabilities and perform error analysis to expose important difficulties for classifying sentiment in a code-switching setting.
We present a transfer learning system to perform a mixed Spanish-English sentiment classification task. Our proposal uses the state-of-the-art language model BERT and embed it within a ULMFiT transfer learning pipeline. This combination allows us to predict the polarity detection of code-mixed (English-Spanish) tweets. Thus, among 29 submitted systems, our approach (referred to as dplominop) is ranked 4th on the Sentimix Spanglish test set of SemEval 2020 Task 9. In fact, our system yields the weighted-F1 score value of 0.755 which can be easily reproduced — the source code and implementation details are made available.
Code mixing is a common phenomena in multilingual societies where people switch from one language to another for various reasons. Recent advances in public communication over different social media sites have led to an increase in the frequency of code-mixed usage in written language. In this paper, we present the Generative Morphemes with Attention (GenMA) Model sentiment analysis system contributed to SemEval 2020 Task 9 SentiMix. The system aims to predict the sentiments of the given English-Hindi code-mixed tweets without using word-level language tags instead inferring this automatically using a morphological model. The system is based on a novel deep neural network (DNN) architecture, which has outperformed the baseline F1-score on the test data-set as well as the validation data-set. Our results can be found under the user name “koustava” on the “Sentimix Hindi English” page.
In this paper, we present an approach for sentiment analysis in code-mixed language on twitter defined in SemEval-2020 Task 9. Our team (referred as LiangZhao) employ different multilingual models with weighted loss focused on complexity of code-mixing in sentence, in which the best model achieved f1-score of 0.806 and ranked 1st of subtask- Sentimix Spanglish. The performance of method is analyzed and each component of our architecture is demonstrated.
This paper describes Amobee’s participation in SemEval-2020 task 7: “Assessing Humor in Edited News Headlines”, sub-tasks 1 and 2. The goal of this task was to estimate the funniness of human modified news headlines. in this paper we present methods to fine-tune and ensemble various language models (LM) based classifiers to for this task. This technique used for both sub-tasks and reached the second place (out of 49) in sub-tasks 1 with RMSE score of 0.5, and the second (out of 32) place in sub-task 2 with accuracy of 66% without using any additional data except the official training set.
We use pretrained transformer-based language models in SemEval-2020 Task 7: Assessing the Funniness of Edited News Headlines. Inspired by the incongruity theory of humor, we use a contrastive approach to capture the surprise in the edited headlines. In the official evaluation, our system gets 0.531 RMSE in Subtask 1, 11th among 49 submissions. In Subtask 2, our system gets 0.632 accuracy, 9th among 32 submissions.
In this paper we describe our system submitted to SemEval 2020 Task 7: “Assessing Humor in Edited News Headlines”. We participated in all subtasks, in which the main goal is to predict the mean funniness of the edited headline given the original and the edited headline. Our system involves two similar sub-networks, which generate vector representations for the original and edited headlines respectively. And then we do a subtract operation of the outputs from two sub-networks to predict the funniness of the edited headline.
Our approach is constructed to improve on a couple of aspects; preprocessing with an emphasis on humor sense detection, using embeddings from state-of-the-art language model(Elmo), and ensembling the results came up with using machine learning model Na ̈ıve Bayes(NB) with a deep learning pre-trained models. Elmo-NB participation has scored (0.5642) on the competition leader board, where results were measured by Root Mean Squared Error (RMSE).
Natural language processing (NLP) has been applied to various fields including text classification and sentiment analysis. In the shared task of assessing the funniness of edited news headlines, which is a part of the SemEval 2020 competition, we preprocess datasets by replacing abbreviation, stemming words, then merge three models including Light Gradient Boosting Machine (LightGBM), Long Short-Term Memory (LSTM), and Bidirectional Encoder Representation from Transformer (BERT) by taking the average to perform the best. Our team Ferryman wins the 9th place in Sub-task 1 of Task 7 - Regression.
This paper presents a neural network system where we participate in the first task of SemEval-2020 shared task 7 “Assessing the Funniness of Edited News Headlines”. Our target is to create to neural network model that can predict the funniness of edited headlines. We build our model using a combination of LSTM and TF-IDF, then a feed-forward neural network. The system manages to slightly improve RSME scores regarding our mean score baseline.
In this paper we describe our contribution to the Semeval-2020 Humor Assessment task. We essentially use three different features that are passed into a ridge regression to determine a funniness score for an edited news headline: statistical, count-based features, semantic features and contextual information. For deciding which one of two given edited headlines is funnier, we additionally use scoring information and logistic regression. Our work was mostly concentrated on investigating features, rather than improving prediction based on pre-trained language models. The resulting system is task-specific, lightweight and performs above the majority baseline. Our experiments indicate that features related to socio-cultural context, in our case mentions of Donald Trump, generally perform better than context-independent features like headline length.
This paper contains a description of my solution to the problem statement of SemEval 2020: Assessing the Funniness of Edited News Headlines. I propose a Siamese Transformer based approach, coupled with a Global Attention mechanism that makes use of contextual embeddings and focus words, to generate important features that are fed to a 2 layer perceptron to rate the funniness of the edited headline. I detail various experiments to show the performance of the system. The proposed approach outperforms a baseline Bi-LSTM architecture and finished 5th (out of 49 teams) in sub-task 1 and 4th (out of 32 teams) in sub-task 2 of the competition and was the best non-ensemble model in both tasks.
This paper presents two different systems for the SemEval shared task 7 on Assessing Humor in Edited News Headlines, sub-task 1, where the aim was to estimate the intensity of humor generated in edited headlines. Our first system is a feature-based machine learning system that combines different types of information (e.g. word embeddings, string similarity, part-of-speech tags, perplexity scores, named entity recognition) in a Nu Support Vector Regressor (NuSVR). The second system is a deep learning-based approach that uses the pre-trained language model RoBERTa to learn latent features in the news headlines that are useful to predict the funniness of each headline. The latter system was also our final submission to the competition and is ranked seventh among the 49 participating teams, with a root-mean-square error (RMSE) of 0.5253.
Task 7, Assessing the Funniness of Edited News Headlines, in the International Workshop SemEval2020 introduces two sub-tasks to predict the funniness values of edited news headlines from the Reddit website. This paper proposes the BFHumor model of the MLEngineer team that participates in both sub-tasks in this competition. The BFHumor’s model is defined as a BERT-Flair based humor detection model that is a combination of different pre-trained models with various Natural Language Processing (NLP) techniques. The Bidirectional Encoder Representations from Transformers (BERT) regressor is considered the primary pre-trained model in our approach, whereas Flair is the main NLP library. It is worth mentioning that the BFHumor model has been ranked 4th in sub-task1 with a root mean square error (RMSE) value of 0.51966, and it is 0.02 away from the first ranked model. Also, the team is ranked 12th in the sub-task2 with an accuracy of 0.62291, which is 0.05 away from the top-ranked model. Our results indicate that the BFHumor model is one of the top models for detecting humor in the text.
The use of pre-trained language models such as BERT and ULMFiT has become increasingly popular in shared tasks, due to their powerful language modelling capabilities. Our entry to SemEval uses ERNIE 2.0, a language model which is pre-trained on a large number of tasks to enrich the semantic and syntactic information learned. ERNIE’s knowledge masking pre-training task is a unique method for learning about named entities, and we hypothesise that it may be of use in a dataset which is built on news headlines and which contains many named entities. We optimize the hyperparameters in a regression and classification model and find that the hyperparameters we selected helped to make bigger gains in the classification model than the regression model.
This paper describes my efforts in evaluating how editing news headlines can make them funnier within the frames of SemEval 2020 Task 7. I participated in both of the sub-tasks: Sub-Task 1 “Regression” and Sub-task 2 “Predict the funnier of the two edited versions of an original headline”. I experimented with a number of different models, but ended up using DeepPavlov logistic regression (LR) with BERT English cased embeddings for the first sub-task and support vector regression model (SVR) for the second. RMSE score obtained for the first task was 0.65099 and accuracy for the second – 0.32915.
This paper describes the work done by the team UniTuebingenCL for the SemEval 2020 Task 7: “Assessing the Funniness of Edited News Headlines”. We participated in both sub-tasks: sub-task A, given the original and the edited headline, predicting the mean funniness of the edited headline; and sub-task B, given the original headline and two edited versions, predicting which edited version is the funnier of the two. A Ridge Regression model using Elmo and Glove embeddings as well as Truncated Singular Value Decomposition was used as the final model. A long short term memory model recurrent network (LSTM) served as another approach for assessing the funniness of a headline. For the first sub-task, we experimented with the extraction of multiple features to achieve lower Root Mean Squared Error. The lowest Root Mean Squared Error achieved was 0.575 for sub-task A, and the highest Accuracy was 0.618 for sub-task B.
We describe the UTFPR system for SemEval-2020’s Task 7: Assessing Humor in Edited News Headlines. Ours is a minimalist unsupervised system that uses word co-occurrence frequencies from large corpora to capture unexpectedness as a mean to capture funniness. Our system placed 22nd on the shared task’s Task 2. We found that our approach requires more text than we used to perform reliably, and that unexpectedness alone is not sufficient to gauge funniness for humorous content that targets a diverse target audience.
This paper describes our participation in SemEval 2020 Task 7 on assessment of humor in edited news headlines, which includes two subtasks, estimating the humor of micro-editd news headlines (subtask A) and predicting the more humorous of the two edited headlines (subtask B). To address these tasks, we propose two systems. The first system adopts a regression-based fine-tuned single-sequence bidirectional encoder representations from transformers (BERT) model with easy data augmentation (EDA), called “BERT+EDA”. The second system adopts a hybrid of a regression-based fine-tuned sequence-pair BERT model and a combined Naive Bayes and support vector machine (SVM) model estimated on term frequency–inverse document frequency (TFIDF) features, called “BERT+NB-SVM”. In this case, no additional training datasets were used, and the BERT+NB-SVM model outperformed BERT+EDA. The official root-mean-square deviation (RMSE) score for subtask A is 0.57369 and ranks 31st out of 48, whereas the best RMSE of BERT+NB-SVM is 0.52429, ranking 7th. For subtask B, we simply use a sequence-pair BERT model, the official accuracy of which is 0.53196 and ranks 25th out of 32.
This paper describes xsysigma team’s system for SemEval 2020 Task 7: Assessing the Funniness of Edited News Headlines. The target of this task is to assess the funniness changes of news headlines after minor editing and is divided into two subtasks: Subtask 1 is a regression task to detect the humor intensity of the sentence after editing; and Subtask 2 is a classification task to predict funnier of the two edited versions of an original headline. In this paper, we only report our implement of Subtask 2. We first construct sentence pairs with different features for Enhancement Inference BERT(EI-BERT)’s input. We then conduct data augmentation strategy and Pseudo-Label method. After that, we apply feature enhancement interaction on the encoding of each sentence for classification with EI-BERT. Finally, we apply weighted fusion algorithm to the logits results which obtained by different pre-trained models. We achieve 64.5% accuracy in subtask2 and rank the first and the fifth in dev and test dataset 1 , respectively.
Memotion analysis is a very crucial and important subject in today’s world that is dominated by social media. This paper presents the results and analysis of the SemEval-2020 Task-8: Memotion analysis by team Kraken that qualified as winners for the task. This involved performing multimodal sentiment analysis on memes commonly posted over social media. The task comprised of 3 subtasks, Task A was to find the overall sentiment of a meme and classify it into positive, negative or neutral, Task B was to classify it into the different types which were namely humour, sarcasm, offensive or motivation where a meme could have more than one category, Task C was to further quantify the classifications achieved in task B. An imbalanced data of 6992 rows was utilized for this which contained images (memes), text (extracted OCR) and their annotations in 17 classes provided by the task organisers. In this paper, the authors proposed a hybrid neural Naïve-Bayes Support Vector Machine and logistic regression to solve a multilevel 17 class classification problem. It achieved the best result in Task B i.e 0.70 F1 score. The authors were ranked third in Task B.
Sentiment analysis, being one of the most sought after research problems within Natural Language Processing (NLP) researchers. The range of problems being addressed by sentiment analysis is increasing. Till now, most of the research focuses on predicting sentiment, or sentiment categories like sarcasm, humor, offense and motivation on text data. But, there is very limited research that is focusing on predicting or analyzing the sentiment of internet memes. We try to address this problem as part of “Task 8 of SemEval 2020: Memotion Analysis”. We have participated in all the three tasks under Memotion Analysis. Our system built using state-of-the-art Transformer-based pre-trained Bidirectional Encoder Representations from Transformers (BERT) performed better compared to baseline models for the two tasks A and C and performed close to the baseline model for task B. In this paper, we present the data used, steps used by us for data cleaning and preparation, the fine-tuning process for BERT based model and finally predict the sentiment or sentiment categories. We found that the sequence models like Long Short Term Memory(LSTM) and its variants performed below par in predicting the sentiments. We also performed a comparative analysis with other Transformer based models like DistilBERT and XLNet.
Internet memes emotion recognition is focused by many researchers. In this paper, we adopt BERT and ResNet for evaluation of detecting the emotions of Internet memes. We focus on solving the problem of data imbalance and data contains noise. We use RandAugment to enhance the data of the picture, and use Training Signal Annealing (TSA) to solve the impact of the imbalance of the label. At the same time, a new loss function is designed to ensure that the model is not affected by input noise which will improve the robustness of the model. We participated in sub-task a and our model based on BERT obtains 34.58% macro F1 score, ranking 10/32.
A meme is a pictorial representation of an idea or theme. In the age of emerging volume of social media platforms, memes are spreading rapidly from person to person and becoming a trending ways of opinion expression. However, due to the multimodal characteristics of meme contents, detecting and analyzing the underlying emotion of a meme is a formidable task. In this paper, we present our approach for detecting the emotion of a meme defined in the SemEval-2020 Task 8. Our team CSECU_KDE_MA employs an attention-based neural network model to tackle the problem. Upon extracting the text contents from a meme using an optical character reader (OCR), we represent it using the distributed representation of words. Next, we perform the convolution based on multiple kernel sizes to obtain the higher-level feature sequences. The feature sequences are then fed into the attentive time-distributed bidirectional LSTM model to learn the long-term dependencies effectively. Experimental results show that our proposed neural model obtained competitive performance among the participants’ systems.
Recent technological advancements in the Internet and Social media usage have resulted in the evolution of faster and efficient platforms of communication. These platforms include visual, textual and speech mediums and have brought a unique social phenomenon called Internet memes. Internet memes are in the form of images with witty, catchy, or sarcastic text descriptions. In this paper, we present a multi-modal sentiment analysis system using deep neural networks combining Computer Vision and Natural Language Processing. Our aim is different than the normal sentiment analysis goal of predicting whether a text expresses positive or negative sentiment; instead, we aim to classify the Internet meme as a positive, negative, or neutral, identify the type of humor expressed and quantify the extent to which a particular effect is being expressed. Our system has been developed using CNN and LSTM and outperformed the baseline score.
In this paper, we describe our ensemble-based system designed by guoym Team for the SemEval-2020 Task 8, Memotion Analysis. In our system, we utilize five types of representation of data as input of base classifiers to extract information from different aspects. We train five base classifiers for each type of representation using five-fold cross-validation. Then the outputs of these base classifiers are combined through data-based ensemble method and feature-based ensemble method to make full use of all data and representations from different aspects. Our method achieves the performance within the top 2 ranks in the final leaderboard of Memotion Analysis among 36 Teams.
Users of social networking services often share their emotions via multi-modal content, usually images paired with text embedded in them. SemEval-2020 task 8, Memotion Analysis, aims at automatically recognizing these emotions of so-called internet memes. In this paper, we propose a simple but effective Modality Ensemble that incorporates visual and textual deep-learning models, which are independently trained, rather than providing a single multi-modal joint network. To this end, we first fine-tune four pre-trained visual models (i.e., Inception-ResNet, PolyNet, SENet, and PNASNet) and four textual models (i.e., BERT, GPT-2, Transformer-XL, and XLNet). Then, we fuse their predictions with ensemble methods to effectively capture cross-modal correlations. The experiments performed on dev-set show that both visual and textual features aided each other, especially in subtask-C, and consequently, our system ranked 2nd on subtask-C.
Social media is abundant in visual and textual information presented together or in isolation. Memes are the most popular form, belonging to the former class. In this paper, we present our approaches for the Memotion Analysis problem as posed in SemEval-2020 Task 8. The goal of this task is to classify memes based on their emotional content and sentiment. We leverage techniques from Natural Language Processing (NLP) and Computer Vision (CV) towards the sentiment classification of internet memes (Subtask A). We consider Bimodal (text and image) as well as Unimodal (text-only) techniques in our study ranging from the Na ̈ıve Bayes classifier to Transformer-based approaches. Our results show that a text-only approach, a simple Feed Forward Neural Network (FFNN) with Word2vec embeddings as input, performs superior to all the others. We stand first in the Sentiment analysis task with a relative improvement of 63% over the baseline macro-F1 score. Our work is relevant to any task concerned with the combination of different modalities.
The information shared on social media is increasingly important; both images and text, and maybe the most popular combination of these two kinds of data are the memes. This manuscript describes our participation in Memotion task at SemEval 2020. This task is about to classify the memes in several categories related to the emotional content of them. For the proposed system construction, we used different strategies, and the best ones were based on deep neural networks and a text categorization algorithm. We obtained results analyzing the text and images separately, and also in combination. Our better performance was achieved in Task A, related to polarity classification.
This paper presents two approaches for the internet meme classification challenge of SemEval-2020 Task 8 by Team KAFK (cosec). The first approach uses both text and image features, while the second approach uses only the images. Error analysis of the two approaches shows that using only the images is more robust to the noise in the text on the memes. We utilize pre-trained DistilBERT and EfficientNet to extract features from the text and image of the memes respectively. Our classification systems obtained macro f1 score of 0.3286 for Task A and 0.5005 for Task B.
Internet memes have become a very popular mode of expression on social media networks today. Their multi-modal nature, caused by a mixture of text and image, makes them a very challenging research object for automatic analysis. In this paper, we describe our contribution to the SemEval-2020 Memotion Analysis Task. We propose a Multi-Modal Multi-Task learning system, which incorporates “memebeddings”, viz. joint text and vision features, to learn and optimize for all three Memotion subtasks simultaneously. The experimental results show that the proposed system constantly outperforms the competition’s baseline, and the system setup with continual learning (where tasks are trained sequentially) obtains the best classification F1-scores.
In this paper, we describe our deep learning system used for SemEval 2020 Task 8: Memotion analysis. We participated in all the subtasks i.e Subtask A: Sentiment classification, Subtask B: Humor classification, and Subtask C: Scales of semantic classes. Similar multimodal architecture was used for each subtask. The proposed architecture makes use of transfer learning for images and text feature extraction. The extracted features are then fused together using stacked bidirectional Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU) model with attention mechanism for final predictions. We also propose a single model for predicting semantic classes (Subtask B) as well as their scales (Subtask C) by branching the final output of the post LSTM dense layers. Our model was ranked 5 in Subtask B and ranked 8 in Subtask C and performed nicely in Subtask A on the leader board. Our system makes use of transfer learning for feature extraction and fusion of image and text features for predictions.
Internet memes are one of the most viral types of content in social media and are equally used in promoting hate speech. Towards a more broad understanding of memes, this paper describes the MemoSys system submitted in Task 8 of SemEval 2020, which aims to classify the sentiment of Internet memes and provide a minimum description of the type of humor it depicts (sarcastic, humorous, offensive, motivational) and its semantic scale. The solution presented covers four deep model architectures which are based on a joint fusion between the VGG16 pre-trained model for extracting visual information and the canonical BERT model or TF-IDF for text understanding. The system placed 5th of 36 participating systems in the task A, offering promising prospects to the use of transfer learning to approach Internet memes understanding.
The paper describes the systems submitted to SemEval-2020 Task 8: Memotion by the ‘NIT-Agartala-NLP-Team’. A dataset of 8879 memes was made available by the task organizers to train and test our models. Our systems include a Logistic Regression baseline, a BiLSTM +Attention-based learner and a transfer learning approach with BERT. For the three sub-tasks A, B and C, we attained ranks 24/33, 11/29 and 15/26, respectively. We highlight our difficulties in harnessing image information as well as some techniques and handcrafted features we employ to overcome these issues. We also discuss various modelling issues and theorize possible solutions and reasons as to why these problems persist.
Memes are steadily taking over the feeds of the public on social media. There is always the threat of malicious users on the internet posting offensive content, even through memes. Hence, the automatic detection of offensive images/memes is imperative along with detection of offensive text. However, this is a much more complex task as it involves both visual cues as well as language understanding and cultural/context knowledge. This paper describes our approach to the task of SemEval-2020 Task 8: Memotion Analysis. We chose to participate only in Task A which dealt with Sentiment Classification, which we formulated as a text classification problem. Through our experiments, we explored multiple training models to evaluate the performance of simple text classification algorithms on the raw text obtained after running OCR on meme images. Our submitted model achieved an accuracy of 72.69% and exceeded the existing baseline’s Macro F1 score by 8% on the official test dataset. Apart from describing our official submission, we shall elucidate how different classification models respond to this task.
This paper describes our system, UI, for task A: Sentiment Classification in SemEval-2020 Task 8 Memotion Analysis. We use a common traditional machine learning, which is SVM, by utilizing the combination of text and images features. The data consist text that extracted from memes and the images of memes. We employ n-gram language model for text features and pre-trained model,VGG-16,for image features. After obtaining both features from text and images in form of 2-dimensional arrays, we concatenate and classify the final features using SVM. The experiment results show SVM achieved 35% for its F1 macro, which is 0.132 points or 13.2% above the baseline model.
Memes are widely used on social media. They usually contain multi-modal information such as images and texts, serving as valuable data sources to analyse opinions and sentiment orientations of online communities. The provided memes data often face an imbalanced data problem, that is, some classes or labelled sentiment categories significantly outnumber other classes. This often results in difficulty in applying machine learning techniques where balanced labelled input data are required. In this paper, a Gaussian Mixture Model sampling method is proposed to tackle the problem of class imbalance for the memes sentiment classification task. To utilise both text and image data, a multi-modal CNN-LSTM model is proposed to jointly learn latent features for positive, negative and neutral category predictions. The experiments show that the re-sampling model can slightly improve the accuracy on the trial data of sub-task A of Task 8. The multi-modal CNN-LSTM model can achieve macro F1 score 0.329 on the test set.
Users from the online environment can create different ways of expressing their thoughts, opinions, or conception of amusement. Internet memes were created specifically for these situations. Their main purpose is to transmit ideas by using combinations of images and texts such that they will create a certain state for the receptor, depending on the message the meme has to send. These posts can be related to various situations or events, thus adding a funny side to any circumstance our world is situated in. In this paper, we describe the system developed by our team for SemEval-2020 Task 8: Memotion Analysis. More specifically, we introduce a novel system to analyze these posts, a multimodal multi-task learning architecture that combines ALBERT for text encoding with VGG-16 for image representation. In this manner, we show that the information behind them can be properly revealed. Our approach achieves good performance on each of the three subtasks of the current competition, ranking 11th for Subtask A (0.3453 macro F1-score), 1st for Subtask B (0.5183 macro F1-score), and 3rd for Subtask C (0.3171 macro F1-score) while exceeding the official baseline results by high margins.
In this paper, we describe the entry to the task of Memotion Analysis. The sentiment analysis of memes task, is motivated by a pervasive problem of offensive content spread in social media, up to the present time. In fact, memes are an important medium of expressing opinion and emotions, therefore they can be hateful at many times. In order to identify emotions expressed by memes we construct a tool based on neural networks and deep learning methods. It takes an advantage of a multi-modal nature of the task and performs fusion of image and text features extracted by models dedicated to this task. Moreover, we show that visual information might be more significant in the sentiment analysis of memes than textual one. Our solution achieved 0.346 macro F1-score in Task A – Sentiment Classification, which brought us to the 7th place in the official rank of the competition.
Sentiment Analysis of code-mixed text has diversified applications in opinion mining ranging from tagging user reviews to identifying social or political sentiments of a sub-population. In this paper, we present an ensemble architecture of convolutional neural net (CNN) and self-attention based LSTM for sentiment analysis of code-mixed tweets. While the CNN component helps in the classification of positive and negative tweets, the self-attention based LSTM, helps in the classification of neutral tweets, because of its ability to identify correct sentiment among multiple sentiment bearing units. We achieved F1 scores of 0.707 (ranked 5th) and 0.725 (ranked 13th) on Hindi-English (Hinglish) and Spanish-English (Spanglish) datasets, respectively. The submissions for Hinglish and Spanglish tasks were made under the usernames ayushk and harsh_6 respectively.
In today’s interconnected and multilingual world, code-mixing of languages on social media is a common occurrence. While many Natural Language Processing (NLP) tasks like sentiment analysis are mature and well designed for monolingual text, techniques to apply these tasks to code-mixed text still warrant exploration. This paper describes our feature engineering approach to sentiment analysis in code-mixed social media text for SemEval-2020 Task 9: SentiMix. We tackle this problem by leveraging a set of hand-engineered lexical, sentiment, and metadata fea- tures to design a classifier that can disambiguate between “positive”, “negative” and “neutral” sentiment. With this model we are able to obtain a weighted F1 score of 0.65 for the “Hinglish” task and 0.63 for the “Spanglish” tasks.
In this paper, we describe a methodology to predict sentiment in code-mixed tweets (hindi-english). Our team called verissimo.manoel in CodaLab developed an approach based on an ensemble of four models (MultiFiT, BERT, ALBERT, and XLNET). The final classification algorithm was an ensemble of some predictions of all softmax values from these four models. This architecture was used and evaluated in the context of the SemEval 2020 challenge (task 9), and our system got 72.7% on the F1 score.
In this paper, we present our approach for sentiment classification on Spanish-English code-mixed social media data in the SemEval-2020 Task 9. We investigate performance of various pre-trained Transformer models by using different fine-tuning strategies. We explore both monolingual and multilingual models with the standard fine-tuning method. Additionally, we propose a custom model that we fine-tune in two steps: once with a language modeling objective, and once with a task-specific objective. Although two-step fine-tuning improves sentiment classification performance over the base model, the large multilingual XLM-RoBERTa model achieves best weighted F1-score with 0.537 on development data and 0.739 on test data. With this score, our team jupitter placed tenth overall in the competition.
The phenomenon of mixing the vocabulary and syntax of multiple languages within the same utterance is called Code-Mixing. This is more evident in multilingual societies. In this paper, we have developed a system for SemEval 2020: Task 9 on Sentiment Analysis of Hindi-English code-mixed social media text. Our system first generates two types of embeddings for the social media text. In those, the first one is character level embeddings to encode the character level information and to handle the out-of-vocabulary entries and the second one is FastText word embeddings for capturing morphology and semantics. These two embeddings were passed to the LSTM network and the system outperformed the baseline model.
Problems involving code-mixed language are often plagued by a lack of resources and an absence of materials to perform sophisticated transfer learning with. In this paper we describe our submission to the Sentimix Hindi-English task involving sentiment classification of code-mixed texts, and with an F1 score of 67.1%, we demonstrate that simple convolution and attention may well produce reasonable results.
Code-mixing is the phenomenon of using multiple languages in the same utterance. It is a frequently used pattern of communication on social media sites such as Facebook, Twitter, etc. Sentiment analysis of the monolingual text is a well-studied task. Code-mixing adds to the challenge of analyzing the sentiment of the text on various platforms such as social media, online gaming, forums, product reviews, etc. We present a candidate sentence generation and selection based approach on top of the Bi-LSTM based neural classifier to classify the Hinglish code-mixed text into one of the three sentiment classes positive, negative, or neutral. The proposed candidate sentence generation and selection based approach show an improvement in the system performance as compared to the Bi-LSTM based neural classifier. We can extend the proposed method to solve other problems with code-mixing in the textual data, such as humor-detection, intent classification, etc.
The paper describes systems that our team IRLab_DAIICT employed for the shared task Sentiment Analysis for Code-Mixed Social Media Text in SemEval 2020. We conducted our experiments on a Hindi-English CodeMixed Tweet dataset which was annotated with sentiment labels. F1-score was the official evaluation metric and our best approach, an ensemble of Logistic Regression, Random Forest and BERT, achieved an F1-score of 0.693.
Sentiment Analysis is a well-studied field of Natural Language Processing. However, the rapid growth of social media and noisy content within them poses significant challenges in addressing this problem with well-established methods and tools. One of these challenges is code-mixing, which means using different languages to convey thoughts in social media texts. Our group, with the name of IUST(username: TAHA), participated at the SemEval-2020 shared task 9 on Sentiment Analysis for Code-Mixed Social Media Text, and we have attempted to develop a system to predict the sentiment of a given code-mixed tweet. We used different preprocessing techniques and proposed to use different methods that vary from NBSVM to more complicated deep neural network models. Our best performing method obtains an F1 score of 0.751 for the Spanish-English sub-task and 0.706 over the Hindi-English sub-task.
Code-mixing is a phenomenon which arises mainly in multilingual societies. Multilingual people, who are well versed in their native languages and also English speakers, tend to code-mix using English-based phonetic typing and the insertion of anglicisms in their main language. This linguistic phenomenon poses a great challenge to conventional NLP domains such as Sentiment Analysis, Machine Translation, and Text Summarization, to name a few. In this work, we focus on working out a plausible solution to the domain of Code-Mixed Sentiment Analysis. This work was done as participation in the SemEval-2020 Sentimix Task, where we focused on the sentiment analysis of English-Hindi code-mixed sentences. our username for the submission was “sainik.mahata” and team name was “JUNLP”. We used feature extraction algorithms in conjunction with traditional machine learning algorithms such as SVR and Grid Search in an attempt to solve the task. Our approach garnered an f1-score of 66.2% when tested using metrics prepared by the organizers of the task.
This paper describes the participation of LIMSI_UPV team in SemEval-2020 Task 9: Sentiment Analysis for Code-Mixed Social Media Text. The proposed approach competed in SentiMix HindiEnglish subtask, that addresses the problem of predicting the sentiment of a given Hindi-English code-mixed tweet. We propose Recurrent Convolutional Neural Network that combines both the recurrent neural network and the convolutional network to better capture the semantics of the text, for code-mixed sentiment analysis. The proposed system obtained 0.69 (best run) in terms of F1 score on the given test data and achieved the 9th place (Codalab username: somban) in the SentiMix Hindi-English subtask.
This paper describes our contribution to the SemEval-2020 Task 9 on Sentiment Analysis for Code-mixed Social Media Text. We investigated two approaches to solve the task of Hinglish sentiment analysis. The first approach uses cross-lingual embeddings resulting from projecting Hinglish and pre-trained English FastText word embeddings in the same space. The second approach incorporates pre-trained English embeddings that are incrementally retrained with a set of Hinglish tweets. The results show that the second approach performs best, with an F1-score of 70.52% on the held-out test data.
Natural language processing (NLP) has been applied to various fields including text classification and sentiment analysis. In the shared task of sentiment analysis of code-mixed tweets, which is a part of the SemEval-2020 competition, we preprocess datasets by replacing emoji and deleting uncommon characters and so on, and then fine-tune the Bidirectional Encoder Representation from Transformers(BERT) to perform the best. After exhausting top3 submissions, Our team MeisterMorxrc achieves an averaged F1 score of 0.730 in this task, and and our codalab username is MeisterMorxrc
Sentiment Analysis refers to the process of interpreting what a sentence emotes and classifying them as positive, negative, or neutral. The widespread popularity of social media has led to the generation of a lot of text data and specifically, in the Indian social media scenario, the code-mixed Hinglish text i.e, the words of Hindi language, written in the Roman script along with other English words is a common sight. The ability to effectively understand the sentiments in these texts is much needed. This paper proposes a system titled NITS-Hinglish to effectively carry out the sentiment analysis of such code-mixed Hinglish text. The system has fared well with a final F-Score of 0.617 on the test data.
We explore the task of sentiment analysis on Hinglish (code-mixed Hindi-English) tweets as participants of Task 9 of the SemEval-2020 competition, known as the SentiMix task. We had two main approaches: 1) applying transfer learning by fine-tuning pre-trained BERT models and 2) training feedforward neural networks on bag-of-words representations. During the evaluation phase of the competition, we obtained an F-score of 71.3% with our best model, which placed 4th out of 62 entries in the official system rankings.
Code-mixing is an interesting phenomenon where the speaker switches between two or more languages in the same text. In this paper, we describe an unconventional approach to tackling the SentiMix Hindi-English challenge (UID: aditya_malte). Instead of directly fine-tuning large contemporary Transformer models, we train our own domain-specific embeddings and make use of them for downstream tasks. We also discuss how this technique provides comparable performance while making for a much more deployable and lightweight model. It should be noted that we have achieved the stated results without using any ensembling techniques, thus respecting a paradigm of efficient and production-ready NLP. All relevant source code shall be made publicly available to encourage the usage and reproduction of the results.
Commonly occurring in settings such as social media platforms, code-mixed content makes the task of identifying sentiment notably more challenging and complex due to the lack of structure and noise present in the data. SemEval-2020 Task 9, SentiMix, was organized with the purpose of detecting the sentiment of a given code-mixed tweet comprising Hindi and English. We tackled this task by comparing the performance of a system, TueMix - a logistic regression algorithm trained with three feature components: TF-IDF n-grams, monolingual sentiment lexicons, and surface features - with a neural network approach. Our results showed that TueMix outperformed the neural network approach and yielded a weighted F1-score of 0.685.
Sentiment analysis is a process widely used in opinion mining campaigns conducted today. This phenomenon presents applications in a variety of fields, especially in collecting information related to the attitude or satisfaction of users concerning a particular subject. However, the task of managing such a process becomes noticeably more difficult when it is applied in cultures that tend to combine two languages in order to express ideas and thoughts. By interleaving words from two languages, the user can express with ease, but at the cost of making the text far less intelligible for those who are not familiar with this technique, but also for standard opinion mining algorithms. In this paper, we describe the systems developed by our team for SemEval-2020 Task 9 that aims to cover two well-known code-mixed languages: Hindi-English and Spanish-English. We intend to solve this issue by introducing a solution that takes advantage of several neural network approaches, as well as pre-trained word embeddings. Our approach (multlingual BERT) achieves promising performance on the Hindi-English task, with an average F1-score of 0.6850, registered on the competition leaderboard, ranking our team 16 out of 62 participants. For the Spanish-English task, we obtained an average F1-score of 0.7064 ranking our team 17th out of 29 participants by using another multilingual Transformer-based model, XLM-RoBERTa.
In social-media platforms such as Twitter, Facebook, and Reddit, people prefer to use code-mixed language such as Spanish-English, Hindi-English to express their opinions. In this paper, we describe different models we used, using the external dataset to train embeddings, ensembling methods for Sentimix, and OffensEval tasks. The use of pre-trained embeddings usually helps in multiple tasks such as sentence classification, and machine translation. In this experiment, we have used our trained code-mixed embeddings and twitter pre-trained embeddings to SemEval tasks. We evaluate our models on macro F1-score, precision, accuracy, and recall on the datasets. We intend to show that hyper-parameter tuning and data pre-processing steps help a lot in improving the scores. In our experiments, we are able to achieve 0.886 F1-Macro on OffenEval Greek language subtask post-evaluation, whereas the highest is 0.852 during the Evaluation Period. We stood third in Spanglish competition with our best F1-score of 0.756. Codalab username is asking28.
In this paper, we describe our system submitted for SemEval 2020 Task 9, Sentiment Analysis for Code-Mixed Social Media Text alongside other experiments. Our best performing system is a Transfer Learning-based model that fine-tunes XLM-RoBERTa, a transformer-based multilingual masked language model, on monolingual English and Spanish data and Spanish-English code-mixed data. Our system outperforms the official task baseline by achieving a 70.1% average F1-Score on the official leaderboard using the test set. For later submissions, our system manages to achieve a 75.9% average F1-Score on the test set using CodaLab username “ahmed0sultan”.
Mixing languages are widely used in social media, especially in multilingual societies like India. Detecting the emotions contained in these languages, which is of great significance to the development of society and political trends. In this paper, we propose an ensemble of pesudo-label based Bert model and TFIDF based SGDClassifier model to identify the sentiments of Hindi-English (Hi-En) code-mixed data. The ensemble model combines the strengths of rich semantic information from the Bert model and word frequency information from the probabilistic ngram model to predict the sentiment of a given code-mixed tweet.Finally our team got an average F1 score of 0.731 on the final leaderboard,and our codalab username is will_go.
This paper reports the zyy1510 team’s work in the International Workshop on Semantic Evaluation (SemEval-2020) shared task on Sentiment analysis for Code-Mixed (Hindi-English, English-Spanish) Social Media Text. The purpose of this task is to determine the polarity of the text, dividing it into one of the three labels positive, negative and neutral. To achieve this goal, we propose an ensemble model of word n-grams-based Multinomial Naive Bayes (MNB) and sub-word level representations in LSTM (Sub-word LSTM) to identify the sentiments of code-mixed data of Hindi-English and English-Spanish. This ensemble model combines the advantage of rich sequential patterns and the intermediate features after convolution from the LSTM model, and the polarity of keywords from the MNB model to obtain the final sentiment score. We have tested our system on Hindi-English and English-Spanish code-mixed social media data sets released for the task. Our model achieves the F1 score of 0.647 in the Hindi-English task and 0.682 in the English-Spanish task, respectively.
In this paper, we present the main findings and compare the results of SemEval-2020 Task 10, Emphasis Selection for Written Text in Visual Media. The goal of this shared task is to design automatic methods for emphasis selection, i.e. choosing candidates for emphasis in textual content to enable automated design assistance in authoring. The main focus is on short text instances for social media, with a variety of examples, from social media posts to inspirational quotes. Participants were asked to model emphasis using plain text with no additional context from the user or other design considerations. SemEval-2020 Emphasis Selection shared task attracted 197 participants in the early phase and a total of 31 teams made submissions to this task. The highest-ranked submission achieved 0.823 Matchm score. The analysis of systems submitted to the task indicates that BERT and RoBERTa were the most common choice of pre-trained models used, and part of speech tag (POS) was the most useful feature. Full results can be found on the task’s website.
We propose a novel method that enables us to determine words that deserve to be emphasized from written text in visual media, relying only on the information from the self-attention distributions of pre-trained language models (PLMs). With extensive experiments and analyses, we show that 1) our zero-shot approach is superior to a reasonable baseline that adopts TF-IDF and that 2) there exist several attention heads in PLMs specialized for emphasis selection, confirming that PLMs are capable of recognizing important words in sentences.
We present the results and the main findings of SemEval-2020 Task 11 on Detection of Propaganda Techniques in News Articles. The task featured two subtasks. Subtask SI is about Span Identification: given a plain-text document, spot the specific text fragments containing propaganda. Subtask TC is about Technique Classification: given a specific text fragment, in the context of a full document, determine the propaganda technique it uses, choosing from an inventory of 14 possible propaganda techniques. The task attracted a large number of participants: 250 teams signed up to participate and 44 made a submission on the test set. In this paper, we present the task, analyze the results, and discuss the system submissions and the methods they used. For both subtasks, the best systems used pre-trained Transformers and ensembles.
This paper presents the winning system for the propaganda Technique Classification (TC) task and the second-placed system for the propaganda Span Identification (SI) task. The purpose of TC task was to identify an applied propaganda technique given propaganda text fragment. The goal of SI task was to find specific text fragments which contain at least one propaganda technique. Both of the developed solutions used semi-supervised learning technique of self-training. Interestingly, although CRF is barely used with transformer-based language models, the SI task was approached with RoBERTa-CRF architecture. An ensemble of RoBERTa-based models was proposed for the TC task, with one of them making use of Span CLS layers we introduce in the present paper. In addition to describing the submitted systems, an impact of architectural decisions and training schemes is investigated along with remarks regarding training models of the same or better quality with lower computational budget. Finally, the results of error analysis are presented.
We present the results and the main findings of SemEval-2020 Task 12 on Multilingual Offensive Language Identification in Social Media (OffensEval-2020). The task included three subtasks corresponding to the hierarchical taxonomy of the OLID schema from OffensEval-2019, and it was offered in five languages: Arabic, Danish, English, Greek, and Turkish. OffensEval-2020 was one of the most popular tasks at SemEval-2020, attracting a large number of participants across all subtasks and languages: a total of 528 teams signed up to participate in the task, 145 teams submitted official runs on the test data, and 70 teams submitted system description papers.
This paper describes Galileo’s performance in SemEval-2020 Task 12 on detecting and categorizing offensive language in social media. For Offensive Language Identification, we proposed a multi-lingual method using Pre-trained Language Models, ERNIE and XLM-R. For offensive language categorization, we proposed a knowledge distillation method trained on soft labels generated by several supervised models. Our team participated in all three sub-tasks. In Sub-task A - Offensive Language Identification, we ranked first in terms of average F1 scores in all languages. We are also the only team which ranked among the top three across all languages. We also took the first place in Sub-task B - Automatic Categorization of Offense Types and Sub-task C - Offence Target Identification.
This paper describes the system designed by ERNIE Team which achieved the first place in SemEval-2020 Task 10: Emphasis Selection For Written Text in Visual Media. Given a sentence, we are asked to find out the most important words as the suggestion for automated design. We leverage the unsupervised pre-training model and finetune these models on our task. After our investigation, we found that the following models achieved an excellent performance in this task: ERNIE 2.0, XLM-ROBERTA, ROBERTA and ALBERT. We combine a pointwise regression loss and a pairwise ranking loss which is more close to the final Match m metric to finetune our models. And we also find that additional feature engineering and data augmentation can help improve the performance. Our best model achieves the highest score of 0.823 and ranks first for all kinds of metrics.
We describe our system for SemEval-2020 Task 11 on Detection of Propaganda Techniques in News Articles. We developed ensemble models using RoBERTa-based neural architectures, additional CRF layers, transfer learning between the two subtasks, and advanced post-processing to handle the multi-label nature of the task, the consistency between nested spans, repetitions, and labels from similar spans in training. We achieved sizable improvements over baseline fine-tuned RoBERTa models, and the official evaluation ranked our system 3rd (almost tied with the 2nd) out of 36 teams on the span identification subtask with an F1 score of 0.491, and 2nd (almost tied with the 1st) out of 31 teams on the technique classification subtask with an F1 score of 0.62.
This paper describes our participation in the SemEval-2020 task Detection of Propaganda Techniques in News Articles. We participate in both subtasks: Span Identification (SI) and Technique Classification (TC). We use a bi-LSTM architecture in the SI subtask and train a complex ensemble model for the TC subtask. Our architectures are built using embeddings from BERT in combination with additional lexical features and extensive label post-processing. Our systems achieve a rank of 8 out of 35 teams in the SI subtask (F1-score: 43.86%) and 8 out of 31 teams in the TC subtask (F1-score: 57.37%).
The paper presents the solution of team ”Inno” to a SEMEVAL 2020 task 11 ”Detection of propaganda techniques in news articles”. The goal of the second subtask is to classify textual segments that correspond to one of the 18 given propaganda techniques in news articles dataset. We tested a pure Transformer-based model with an optimized learning scheme on the ability to distinguish propaganda techniques between each other. Our model showed 0:6 and 0:58 overall F1 score on validation set and test set accordingly and non-zero F1 score on each class on both sets.
This paper describes our contribution to SemEval-2020 Task 11: Detection Of Propaganda Techniques In News Articles. We start with simple LSTM baselines and move to an autoregressive transformer decoder to predict long continuous propaganda spans for the first subtask. We also adopt an approach from relation extraction by enveloping spans mentioned above with special tokens for the second subtask of propaganda technique classification. Our models report an F-score of 44.6% and a micro-averaged F-score of 58.2% for those tasks accordingly.
This paper describes the NTUAAILS submission for SemEval 2020 Task 11 Detection of Propaganda Techniques in News Articles. This task comprises of two different sub-tasks, namely A: Span Identification (SI), B: Technique Classification (TC). The goal for the SI sub-task is to identify specific fragments, in a given plain text, containing at least one propaganda technique. The TC sub-task aims to identify the applied propaganda technique in a given text fragment. A different model was trained for each sub-task. Our best performing system for the SI task consists of pre-trained ELMo word embeddings followed by residual bidirectional LSTM network. For the TC sub-task pre-trained word embeddings from GloVe fed to a bidirectional LSTM neural network. The models achieved rank 28 among 36 teams with F1 score of 0.335 and rank 25 among 31 teams with 0.463 F1 score for SI and TC sub-tasks respectively. Our results indicate that the proposed deep learning models, although relatively simple in architecture and fast to train, achieve satisfactory results in the tasks on hand.
This paper presents our systems for SemEval 2020 Shared Task 11: Detection of Propaganda Techniques in News Articles. We participate in both the span identification and technique classification subtasks and report on experiments using different BERT-based models along with handcrafted features. Our models perform well above the baselines for both tasks, and we contribute ablation studies and discussion of our results to dissect the effectiveness of different features and techniques with the goal of aiding future studies in propaganda detection.
This paper summarizes our studies on propaganda detection techniques for news articles in the SemEval-2020 task 11. This task is divided into the SI and TC subtasks. We implemented the GloVe word representation, the BERT pretraining model, and the LSTM model architecture to accomplish this task. Our approach achieved good results for both the SI and TC subtasks. The macro- F 1 - score for the SI subtask is 0.406, and the micro- F 1 - score for the TC subtask is 0.505. Our method significantly outperforms the officially released baseline method, and the SI and TC subtasks rank 17th and 22nd, respectively, for the test set. This paper also compares the performances of different deep learning model architectures, such as the Bi-LSTM, LSTM, BERT, and XGBoost models, on the detection of news promotion techniques.
This paper describes the systems our team (AdelaideCyC) has developed for SemEval Task 12 (OffensEval 2020) to detect offensive language in social media. The challenge focuses on three subtasks – offensive language identification (subtask A), offense type identification (subtask B), and offense target identification (subtask C). Our team has participated in all the three subtasks. We have developed machine learning and deep learning-based ensembles of models. We have achieved F1-scores of 0.906, 0.552, and 0.623 in subtask A, B, and C respectively. While our performance scores are promising for subtask A, the results demonstrate that subtask B and C still remain challenging to classify.
This paper describes our participation in SemEval-2020 Task 12: Multilingual Offensive Language Detection. We jointly-trained a single model by fine-tuning Multilingual BERT to tackle the task across all the proposed languages: English, Danish, Turkish, Greek and Arabic. Our single model had competitive results, with a performance close to top-performing systems in spite of sharing the same parameters across all languages. Zero-shot and few-shot experiments were also conducted to analyze the transference performance among these languages. We make our code public for further research
Social media platforms such as Twitter offer people an opportunity to publish short posts in which they can share their opinions and perspectives. While these applications can be valuable, they can also be exploited to promote negative opinions, insults, and hatred against a person, race, or group. These opinions can be spread to millions of people at the click of a mouse. As such, there is a need to develop mechanisms by which offensive language can be automatically detected in social media channels and managed in a timely manner. To help achieve this goal, SemEval 2020 offered a shared task (OffensEval 2020) that involved the detection of offensive text in Arabic. We propose an ensemble approach that combines different levels of word embedding models and transfers learning from other sources of emotion-related tasks. The proposed system ranked 9th out of the 52 entries within the Arabic Offensive language identification subtask.
In this paper we present our submission to sub-task A at SemEval 2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval2). For Danish, Turkish, Arabic and Greek, we develop an architecture based on transfer learning and relying on a two-channel BERT model, in which the English BERT and the multilingual one are combined after creating a machine-translated parallel corpus for each language in the task. For English, instead, we adopt a more standard, single-channel approach. We find that, in a multilingual scenario, with some languages having small training data, using parallel BERT models with machine translated data can give systems more stability, especially when dealing with noisy data. The fact that machine translation on social media data may not be perfect does not hurt the overall classification performance.
We introduce an approach to multilingual Offensive Language Detection based on the mBERT transformer model. We download extra training data from Twitter in English, Danish, and Turkish, and use it to re-train the model. We then fine-tuned the model on the provided training data and, in some configurations, implement transfer learning approach exploiting the typological relatedness between English and Danish. Our systems obtained good results across the three languages (.9036 for EN, .7619 for DA, and .7789 for TR).
Offensive language detection is an important and challenging task in natural language processing. We present our submissions to the OffensEval 2020 shared task, which includes three English sub-tasks: identifying the presence of offensive language (Sub-task A), identifying the presence of target in offensive language (Sub-task B), and identifying the categories of the target (Sub-task C). Our experiments explore using a domain-tuned contextualized language model (namely, BERT) for this task. We also experiment with different components and configurations (e.g., a multi-view SVM) stacked upon BERT models for specific sub-tasks. Our submissions achieve F1 scores of 91.7% in Sub-task A, 66.5% in Sub-task B, and 63.2% in Sub-task C. We perform an ablation study which reveals that domain tuning considerably improves the classification performance. Furthermore, error analysis shows common misclassification errors made by our model and outlines research directions for future.
Task 12 of SemEval 2020 consisted of 3 subtasks, namely offensive language identification (Subtask A), categorization of offense type (Subtask B), and offense target identification (Subtask C). This paper presents the results our classifiers obtained for the English language in the 3 subtasks. The classifiers used by us were BERT and BiLSTM. On the test set, our BERT classifier obtained macro F1 score of 0.90707 for subtask A, and 0.65279 for subtask B. The BiLSTM classifier obtained macro F1 score of 0.57565 for subtask C. The paper also performs an analysis of the errors made by our classifiers. We conjecture that the presence of few misleading instances in the dataset is affecting the performance of the classifiers. Our analysis also discusses the need of temporal context and world knowledge to determine the offensiveness of few comments.
This paper presents the different models submitted by the LT@Helsinki team for the SemEval 2020 Shared Task 12. Our team participated in sub-tasks A and C; titled offensive language identification and offense target identification, respectively. In both cases we used the so-called Bidirectional Encoder Representation from Transformer (BERT), a model pre-trained by Google and fine-tuned by us on the OLID and SOLID datasets. The results show that offensive tweet classification is one of several language-based tasks where BERT can achieve state-of-the-art results.
This paper describes our approach to the task of identifying offensive languages in a multilingual setting. We investigate two data augmentation strategies: using additional semi-supervised labels with different thresholds and cross-lingual transfer with data selection. Leveraging the semi-supervised dataset resulted in performance improvements compared to the baseline trained solely with the manually-annotated dataset. We propose a new metric, Translation Embedding Distance, to measure the transferability of instances for cross-lingual data selection. We also introduce various preprocessing steps tailored for social media text along with methods to fine-tune the pre-trained multilingual BERT (mBERT) for offensive language identification. Our multilingual systems achieved competitive results in Greek, Danish, and Turkish at OffensEval 2020.
This paper presents our contribution to the Offensive Language Classification Task (English SubTask A) of Semeval 2020. We propose different Bert models trained on several offensive language classification and profanity datasets, and combine their output predictions in an ensemble model. We experimented with different ensemble approaches, such as SVMs, Gradient boosting, AdaBoosting and Logistic Regression. We further propose an under-sampling approach of the current SOLID dataset, which removed the most uncertain partitions of the dataset, increasing the recall of the dataset. Our best model, an average ensemble of four different Bert models, achieved 11th place out of 82 participants with a macro F1 score of 0.91344 in the English SubTask A.
This work addresses the classification problem defined by sub-task A (English only) of the OffensEval 2020 challenge. We used a semi-supervised approach to classify given tweets into an offensive (OFF) or not-offensive (NOT) class. As the OffensEval 2020 dataset is loosely labelled with confidence scores given by unsupervised models, we used last year’s offensive language identification dataset (OLID) to label the OffensEval 2020 dataset. Our approach uses a pseudo-labelling method to annotate the current dataset. We trained four text classifiers on the OLID dataset and the classifier with the highest macro-averaged F1-score has been used to pseudo label the OffensEval 2020 dataset. The same model which performed best amongst four text classifiers on OLID dataset has been trained on the combined dataset of OLID and pseudo labelled OffensEval 2020. We evaluated the classifiers with precision, recall and macro-averaged F1-score as the primary evaluation metric on the OLID and OffensEval 2020 datasets. This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http://creativecommons.org/licenses/by/4.0/.
The present paper describes the system submitted by the PRHLT-UPV team for the task 12 of SemEval-2020: OffensEval 2020. The official title of the task is Multilingual Offensive Language Identification in Social Media, and aims to identify offensive language in texts. The languages included in the task are English, Arabic, Danish, Greek and Turkish. We propose a model based on the BERT architecture for the analysis of texts in English. The approach leverages knowledge within a pre-trained model and performs fine-tuning for the particular task. In the analysis of the other languages the Multilingual BERT is used, which has been pre-trained for a large number of languages. In the experiments, the proposed method for English texts is compared with other approaches to analyze the relevance of the architecture used. Furthermore, simple models for the other languages are evaluated to compare them with the proposed one. The experimental results show that the model based on BERT outperforms other approaches. The main contribution of this work lies in this study, despite not obtaining the first positions in most cases of the competition ranking.
In this paper, we describe the PUM team’s entry to the SemEval-2020 Task 12. Creating our solution involved leveraging two well-known pretrained models used in natural language processing: BERT and XLNet, which achieve state-of-the-art results in multiple NLP tasks. The models were fine-tuned for each subtask separately and features taken from their hidden layers were combinedand fed into a fully connected neural network. The model using aggregated Transformer featurescan serve as a powerful tool for offensive language identification problem. Our team was ranked7th out of 40 in Sub-task C - Offense target identification with 64.727% macro F1-score and 64thout of 85 in Sub-task A - Offensive language identification (89.726% F1-score).
This paper describes the participation of SINAI team at Task 12: OffensEval 2: Multilingual Offensive Language Identification in Social Media. In particular, the participation in Sub-task A in English which consists of identifying tweets as offensive or not offensive. We preprocess the dataset according to the language characteristics used on social media. Then, we select a small set from the training set provided by the organizers and fine-tune different Transformerbased models in order to test their effectiveness. Our team ranks 20th out of 85 participants in Subtask-A using the XLNet model.
With the proliferation of social media platforms, anonymous discussions together with easy online access, reports on offensive content have caused serious concern to both authorities and research communities. Although there is extensive research in identifying textual offensive language from online content, the dynamic discourse of social media content, as well as the emergence of new forms of offensive language, especially in a multilingual setting, calls for future research in the issue. In this work, we tackled Task A, B, and C of Offensive Language Challenge at SemEval2020. We handled offensive language in five languages: English, Greek, Danish, Arabic, and Turkish. Specifically, we pre-processed all provided datasets and developed an appropriate strategy to handle Tasks (A, B, & C) for identifying the presence/absence, type and the target of offensive language in social media. For this purpose, we used OLID2019, OLID2020 datasets, and generated new datasets, which we made publicly available. We used the provided unsupervised machine learning implementation for automated annotated datasets and the online Google translation tools to create new datasets as well. We discussed the limitations and the success of our machine learning-based approach for all the five different languages. Our results for identifying offensive posts (Task A) yielded satisfactory accuracy of 0.92 for English, 0.81 for Danish, 0.84 for Turkish, 0.85 for Greek, and 0.89 for Arabic. For the type detection (Task B), the results are significantly higher (.87 accuracy) compared to target detection (Task C), which yields .81 accuracy. Moreover, after using automated Google translation, the overall efficiency improved by 2% for Greek, Turkish, and Danish.
Fine-tuning of pre-trained transformer networks such as BERT yield state-of-the-art results for text classification tasks. Typically, fine-tuning is performed on task-specific training datasets in a supervised manner. One can also fine-tune in unsupervised manner beforehand by further pre-training the masked language modeling (MLM) task. Hereby, in-domain data for unsupervised MLM resembling the actual classification target dataset allows for domain adaptation of the model. In this paper, we compare current pre-trained transformer networks with and without MLM fine-tuning on their performance for offensive language detection. Our MLM fine-tuned RoBERTa-based classifier officially ranks 1st in the SemEval 2020 Shared Task 12 for the English language. Further experiments with the ALBERT model even surpass this result.
In visual media, text emphasis is the strengthening of words in a text to convey the intent of the author. Text emphasis in visual media is generally done by using different colors, backgrounds, or font for the text; it helps in conveying the actual meaning of the message to the readers. Emphasis selection is the task of choosing candidate words for emphasis, it helps in automatically designing posters and other media contents with written text. If we consider only the text and do not know the intent, then there can be multiple valid emphasis selections. We propose the use of ensembles for emphasis selection to improve over single emphasis selection models. We show that the use of multi-embedding helps in enhancing the results for base models. To show the efficacy of proposed approach we have also done a comparison of our results with state-of-the-art models.
This paper describes the model we apply in the SemEval-2020 Task 10. We formalize the task of emphasis selection as a simplified query-based machine reading comprehension (MRC) task, i.e. answering a fixed question of “Find candidates for emphasis”. We propose our subword puzzle encoding mechanism and subword fusion layer to align and fuse subwords. By introducing the semantic prior knowledge of the informative query and some other techniques, we attain the 7th place during the evaluation phase and the first place during train phase.
This paper shows our system for SemEval-2020 task 10, Emphasis Selection for Written Text in Visual Media. Our strategy is two-fold. First, we propose fine-tuning many pre-trained language models, predicting an emphasis probability distribution over tokens. Then, we propose stacking a trainable distribution fusion DistFuse system to fuse the predictions of the fine-tuned models. Experimental results show tha DistFuse is comparable or better when compared with a naive average ensemble. As a result, we were ranked 2nd amongst 31 teams.
We propose an end-to-end model that takes as input the text and corresponding to each word gives the probability of the word to be emphasized. Our results show that transformer-based models are particularly effective in this task. We achieved an evaluation score of 0.810 and were ranked third on the leaderboard.
To select tokens to be emphasised in short texts, a system mainly based on precomputed embedding models, such as BERT and ELMo, and LightGBM is proposed. Its performance is low. Additional analyzes suggest that its effectiveness is poor at predicting the highest emphasis scores while they are the most important for the challenge and that it is very sensitive to the specific instances provided during learning.
This paper presents our submission to the SemEval 2020 - Task 10 on emphasis selection in written text. We approach this emphasis selection problem as a sequence labeling task where we represent the underlying text with various contextual embedding models. We also employ label distribution learning to account for annotator disagreements. We experiment with the choice of model architectures, trainability of layers, and different contextual embeddings. Our best performing architecture is an ensemble of different models, which achieved an overall matching score of 0.783, placing us 15th out of 31 participating teams. Lastly, we analyze the results in terms of parts of speech tags, sentence lengths, and word ordering.
This paper describes our approach to emphasis selection for written text in visual media as a solution for SemEval 2020 Task 10. We used an ensemble of several different Transformer-based models and cast the task as a sequence labeling problem with two tags: ‘I’ as ‘emphasized’ and ‘O’ as ‘non-emphasized’ for each token in the text.
This paper describes the emphasis selection system of the team TextLearner for SemEval 2020 Task 10: Emphasis Selection For Written Text in Visual Media. The system aims to learn the emphasis selection distribution using contextual representations extracted from pre-trained language models and a two-staged ranking model. The experimental results demonstrate the strong contextual representation power of the recent advanced transformer-based language model RoBERTa, which can be exploited using a simple but effective architecture on top.
In visual communication, the ability of a short piece of text to catch someone’s eye in a single glance or from a distance is of paramount importance. In our approach to the SemEval-2020 task “Emphasis Selection For Written Text in Visual Media”, we use contextualized word representations from a pretrained model of the state-of-the-art BERT architecture together with a stacked bidirectional GRU network to predict token-level emphasis probabilities. For tackling low inter-annotator agreement in the dataset, we attempt to model multiple annotators jointly by introducing initialization with agreement dependent noise to a crowd layer architecture. We found our approach to both perform substantially better than initialization with identities for this purpose and to outperform a baseline trained with token level majority voting. Our submission system reaches substantially higher Match m on the development set than the task baseline (0.779), but only slightly outperforms the test set baseline (0.754) using a three model ensemble.
In this work we describe and analyze a supervised learning system for word emphasis selection in phrases drawn from visual media as a part of the Semeval 2020 Shared Task 10. More specifically, we begin by briefly introducing the shared task problem and provide an analysis of interesting and relevant features present in the training dataset. We then introduce our LSTM-based model and describe its structure, input features, and limitations. Our model ultimately failed to beat the benchmark score, achieving an average match() score of 0.704 on the validation data (0.659 on the test data) but predicted 84.8% of words correctly considering a 0.5 threshold. We conclude with a thorough analysis and discussion of erroneous predictions with many examples and visualizations.
In this study, we propose a multi-granularity ordinal classification method to address the problem of emphasis selection. In detail, the word embedding is learned from Embeddings from Language Model (ELMO) to extract feature vector representation. Then, the ordinal classifica-tions are implemented on four different multi-granularities to approximate the continuous em-phasize values. Comparative experiments were conducted to compare the model with baseline in which the problem is transformed to label distribution problem.
In this paper, we present the result of our experiment with a variant of 1 Dimensional Convolutional Neural Network (Conv1D) hyper-parameters value. We describe the system entered by the team of Information Retrieval Lab. Universitas Indonesia (3218IR) in the SemEval 2020 Task 11 Sub Task 1 about propaganda span identification in news articles. The best model obtained an F1 score of 0.24 in the development set and 0.23 in the test set. We show that there is a potential for performance improvement through the use of models with appropriate hyper-parameters. Our system uses a combination of Conv1D and GloVe as Word Embedding to detect propaganda in the fragment text level.
Propaganda spreads the ideology and beliefs of like-minded people, brainwashing their audiences, and sometimes leading to violence. SemEval 2020 Task-11 aims to design automated systems for news propaganda detection. Task-11 consists of two sub-tasks, namely, Span Identification - given any news article, the system tags those specific fragments which contain at least one propaganda technique; and Technique Classification - correctly classify a given propagandist statement amongst 14 propaganda techniques. For sub-task 1, we use contextual embeddings extracted from pre-trained transformer models to represent the text data at various granularities and propose a multi-granularity knowledge sharing approach. For sub-task 2, we use an ensemble of BERT and logistic regression classifiers with linguistic features. Our results reveal that the linguistic features are the strong indicators for covering minority classes in a highly imbalanced dataset.
This report describes the methods employed by the Democritus University of Thrace (DUTH) team for participating in SemEval-2020 Task 11: Detection of Propaganda Techniques in News Articles. Our team dealt with Subtask 2: Technique Classification. We used shallow Natural Language Processing (NLP) preprocessing techniques to reduce the noise in the dataset, feature selection methods, and common supervised machine learning algorithms. Our final model is based on using the BERT system with entity mapping. To improve our model’s accuracy, we mapped certain words into five distinct categories by employing word-classes and entity recognition
In this paper, we show our system for SemEval-2020 task 11, where we tackle propaganda span identification (SI) and technique classification (TC). We investigate heterogeneous pre-trained language models (PLMs) such as BERT, GPT-2, XLNet, XLM, RoBERTa, and XLM-RoBERTa for SI and TC fine-tuning, respectively. In large-scale experiments, we found that each of the language models has a characteristic property, and using an ensemble model with them is promising. Finally, the ensemble model was ranked 1st amongst 35 teams for SI and 3rd amongst 31 teams for TC.
This paper presents the submission to semeval-2020 task 11, Detection of Propaganda Techniques in News Articles. Knowing that there are two subtasks in this competition, we have participated in the Technique Classification subtask (TC), which aims to identify the propaganda techniques used in a specific propaganda span. We have used and implemented various models to detect propaganda. Our proposed model is based on BERT uncased pre-trained language model as it has achieved state-of-the-art performance on multiple NLP benchmarks. The performance results of our proposed model have scored 0.55307 F1-Score, which outperforms the baseline model provided by the organizers with 0.2519 F1-Score, and our model is 0.07 away from the best performing team. Compared to other participating systems, our submission is ranked 15th out of 31 participants.
In this paper we describe our submission for the task of Propaganda Span Identification in news articles. We introduce a BERT-BiLSTM based span-level propaganda classification model that identifies which token spans within the sentence are indicative of propaganda. The ”multi-granular” model incorporates linguistic knowledge at various levels of text granularity, including word, sentence and document level syntactic, semantic and pragmatic affect features, which significantly improve model performance, compared to its language-agnostic variant. To facilitate better representation learning, we also collect a corpus of 10k news articles, and use it for fine-tuning the model. The final model is a majority-voting ensemble which learns different propaganda class boundaries by leveraging different subsets of incorporated knowledge.
This paper describes our submissions to SemEval 2020 Task 11: Detection of Propaganda Techniques in News Articles for each of the two subtasks of Span Identification and Technique Classification. We make use of pre-trained BERT language model enhanced with tagging techniques developed for the task of Named Entity Recognition (NER), to develop a system for identifying propaganda spans in the text. For the second subtask, we incorporate contextual features in a pre-trained RoBERTa model for the classification of propaganda techniques. We were ranked 5th in the propaganda technique classification subtask.
Since propaganda became more common technique in news, it is very important to look for possibilities of its automatic detection. In this paper, we present neural model architecture submitted to the SemEval-2020 Task 11 competition: “Detection of Propaganda Techniques in News Articles”. We participated in both subtasks, propaganda span identification and propaganda technique classification. Our model utilizes recurrent Bi-LSTM layers with pre-trained word representations and also takes advantage of self-attention mechanism. Our model managed to achieve score 0.405 F1 for subtask 1 and 0.553 F1 for subtask 2 on test set resulting in 17th and 16th place in subtask 1 and subtask 2, respectively.
This paper explains our teams’ submission to the Shared Task of Fine-Grained Propaganda Detection in which we propose a sequential BERT-CRF based Span Identification model where the fine-grained detection is carried out only on the articles that are flagged as containing propaganda by an ensemble SLC model. We propose this setup bearing in mind the practicality of this approach in identifying propaganda spans in the exponentially increasing content base where the fine-tuned analysis of the entire data repository may not be the optimal choice due to its massive computational resource requirements. We present our analysis on different voting ensembles for the SLC model. Our system ranks 14th on the test set and 22nd on the development set and with an F1 score of 0.41 and 0.39 respectively.
This paper presents a solution for the Span Identification (SI) task in the “Detection of Propaganda Techniques in News Articles” competition at SemEval-2020. The goal of the SI task is to identify specific fragments of each article which contain the use of at least one propaganda technique. This is a binary sequence tagging task. We tested several approaches finally selecting a fine-tuned BERT model as our baseline model. Our main contribution is an investigation of several unsupervised data augmentation techniques based on distributional semantics expanding the original small training dataset as applied to this BERT-based sequence tagger. We explore various expansion strategies and show that they can substantially shift the balance between precision and recall, while maintaining comparable levels of the F1 score.
This paper describes a system developed for detecting propaganda techniques from news articles. We focus on examining how emotional salience features extracted from a news segment can help to characterize and predict the presence of propaganda techniques. Correlation analyses surfaced interesting patterns that, for instance, the “loaded language” and “slogan” techniques are negatively associated with valence and joy intensity but are positively associated with anger, fear and sadness intensity. In contrast, “flag waving” and “appeal to fear-prejudice” have the exact opposite pattern. Through predictive experiments, results further indicate that whereas BERT-only features obtained F1-score of 0.548, emotion intensity features and BERT hybrid features were able to obtain F1-score of 0.570, when a simple feedforward network was used as the classifier in both settings. On gold test data, our system obtained micro-averaged F1-score of 0.558 on overall detection efficacy over fourteen propaganda techniques. It performed relatively well in detecting “loaded language” (F1 = 0.772), “name calling and labeling” (F1 = 0.673), “doubt” (F1 = 0.604) and “flag waving” (F1 = 0.543).
This paper describes our system (Solomon) details and results of participation in the SemEval 2020 Task 11 ”Detection of Propaganda Techniques in News Articles”. We participated in Task ”Technique Classification” (TC) which is a multi-class classification task. To address the TC task, we used RoBERTa based transformer architecture for fine-tuning on the propaganda dataset. The predictions of RoBERTa were further fine-tuned by class-dependent-minority-class classifiers. A special classifier, which employs dynamically adapted Least Common Sub-sequence algorithm, is used to adapt to the intricacies of repetition class. Compared to the other participating systems, our submission is ranked 4th on the leaderboard.
This paper describes the BERT-based models proposed for two subtasks in SemEval-2020 Task 11: Detection of Propaganda Techniques in News Articles. We first build the model for Span Identification (SI) based on SpanBERT, and facilitate the detection by a deeper model and a sentence-level representation. We then develop a hybrid model for the Technique Classification (TC). The hybrid model is composed of three submodels including two BERT models with different training methods, and a feature-based Logistic Regression model. We endeavor to deal with imbalanced dataset by adjusting cost function. We are in the seventh place in SI subtask (0.4711 of F1-measure), and in the third place in TC subtask (0.6783 of F1-measure) on the development set.
The identification of communication techniques in news articles such as propaganda is important, as such techniques can influence the opinions of large numbers of people. Most work so far focused on the identification at the news article level. Recently, a new dataset and shared task has been proposed for the identification of propaganda techniques at the finer-grained span level. This paper describes our system submission to the subtask of technique classification (TC) for the SemEval 2020 shared task on detection of propaganda techniques in news articles. We propose a method of combining neural BERT representations with hand-crafted features via stacked generalization. Our model has the added advantage that it combines the power of contextual representations from BERT with simple span-based and article-based global features. We present an ablation study which shows that even though BERT representations are very powerful also for this task, BERT still benefits from being combined with carefully designed task-specific features.
In this paper, we present our approach for the ’Detection of Propaganda Techniques in News Articles’ task as a part of the 2020 edition of International Workshop on Semantic Evaluation. The specific objective of this task is to identify and extract the text segments in which propaganda techniques are used. We propose a multi-system deep learning framework that can be used to identify the presence of propaganda fragments in a news article and also deep dive into the diverse enhancements of BERT architecture which are part of the final solution. Our proposed final model gave an F1-score of 0.48 on the test dataset.
In this paper, we describe our approaches and systems for the SemEval-2020 Task 11 on propaganda technique detection. We fine-tuned BERT and RoBERTa pre-trained models then merged them with an average ensemble. We conducted several experiments for input representations dealing with long texts and preserving context as well as for the imbalanced class problem. Our system ranked 20th out of 36 teams with 0.398 F1 in the SI task and 14th out of 31 teams with 0.556 F1 in the TC task.
The “Detection of Propaganda Techniques in News Articles” task at the SemEval 2020 competition focuses on detecting and classifying propaganda, pervasive in news article. In this paper, we present a system able to evaluate on sentence level, three traditional text representation techniques for these study goals, using: tf*idf, word and character n-grams. Firstly, we built a binary classifier able to provide corresponding propaganda labels, propaganda or non-propaganda. Secondly, we build a multilabel multiclass model to identify applied propaganda.
We describe our participation at the SemEval 2020 “Detection of Propaganda Techniques in News Articles” - Techniques Classification (TC) task, designed to categorize textual fragments into one of the 14 given propaganda techniques. Our solution leverages pre-trained BERT models. We present our model implementations, evaluation results and analysis of these results. We also investigate the potential of combining language models with resampling and ensemble learning methods to deal with data imbalance and improve performance.
Our system for the PropEval task explores the ability of semantic features to detect and label propagandistic rhetorical techniques in English news articles. For Subtask 2, labeling identified propagandistic fragments with one of fourteen technique labels, our system attains a micro-averaged F1 of 0.40; in this paper, we take a detailed look at the fourteen labels and how well our semantically-focused model detects each of them. We also propose strategies to fill the gaps.
Manipulative and misleading news have become a commodity for some online news outlets and these news have gained a significant impact on the global mindset of people. Propaganda is a frequently employed manipulation method having as goal to influence readers by spreading ideas meant to distort or manipulate their opinions. This paper describes our participation in the SemEval-2020, Task 11: Detection of PropagandaTechniques in News Articles competition. Our approach considers specializing a pre-trained BERT model on propagandistic and hyperpartisan news articles, enabling it to create more adequate representations for the two subtasks, namely propaganda Span Identification (SI) and propaganda Technique Classification (TC). Our proposed system achieved a F1-score of 46.060% in subtask SI, ranking 5th in the leaderboard from 36 teams and a micro-averaged F1 score of 54.302% for subtask TC, ranking 19th from 32 teams.
The article describes a fast solution to propaganda detection at SemEval-2020 Task 11, based on feature adjustment. We use per-token vectorization of features and a simple Logistic Regression classifier to quickly test different hypotheses about our data. We come up with what seems to us the best solution, however, we are unable to align it with the result of the metric suggested by the organizers of the task. We test how our system handles class and feature imbalance by varying the number of samples of two classes (Propaganda and None) in the training set, the size of a context window in which a token is vectorized and combination of vectorization means. The result of our system at SemEval2020 Task 11 is F-score=0.37.
In this work, we combine the state-of-the-art BERT architecture with the semi-supervised learning technique UDA in order to exploit unlabeled raw data to assess humor and detect propaganda in the tasks 7 and 11 of the SemEval-2020 competition. The use of UDA shows promising results with a systematic improvement of the performances over the four different subtasks, and even outperforms supervised learning with the additional labels of the Funlines dataset.
We only participated in the first subtask, and a neural sequence model was used to perform the sequence tagging task. We investigated the effects of different markup strategies on model performance. Bert that performed very well in NLP was used as a feature extractor.
Social media platforms, online news commenting spaces, and many other public forums have become widely known for issues of abusive behavior such as cyber-bullying and personal attacks. In this paper, we use the annotated tweets of the Offensive Language Identification Dataset (OLID) to train three levels of deep learning classifiers to solve the three sub-tasks associated with the dataset. Sub-task A is to determine if the tweet is toxic or not. Then, for offensive tweets, sub-task B requires determining whether the toxicity is targeted. Finally, for sub-task C, we predict the target of the offense; i.e. a group, individual, or other entity. In our solution, we tackle the problem of class imbalance in the dataset by using back translation for data augmentation and utilizing the fine-tuned BERT model in an ensemble of deep learning classifiers. We used this solution to participate in the three English sub-tasks of SemEval-2020 task 12. The proposed solution achieved 0.91393, 0.6300, and 0.57607 macro F1-average in sub-tasks A, B, and C respectively. We achieved the 9th, 14th, and 22nd places for sub-tasks A, B and C respectively.
This paper describes the systems submitted by the Arabic Language Technology group (ALT) at SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media. We focus on sub-task A (Offensive Language Identification) for two languages: Arabic and English. Our efforts for both languages achieved more than 90% macro-averaged F1-score on the official test set. For Arabic, the best results were obtained by a system combination of Support Vector Machine, Deep Neural Network, and fine-tuned Bidirectional Encoder Representations from Transformers (BERT). For English, the best results were obtained by fine-tuning BERT.
This paper describes a method and system to solve the problem of detecting offensive language in social media using anti-adversarial features. Our submission to the SemEval-2020 task 12 challenge was generated by an stacked ensemble of neural networks fine-tuned on the OLID dataset and additional external sources. For Task-A (English), text normalisation filters were applied at both graphical and lexical level. The normalisation step effectively mitigates not only the natural presence of lexical variants but also intentional attempts to bypass moderation by introducing out of vocabulary words. Our approach provides strong F1 scores for both 2020 (0.9134) and 2019 (0.8258) challenges.
In this paper, we describe the team BRUMS entry to OffensEval 2: Multilingual Offensive Language Identification in Social Media in SemEval-2020. The OffensEval organizers provided participants with annotated datasets containing posts from social media in Arabic, Danish, English, Greek and Turkish. We present a multilingual deep learning model to identify offensive language in social media. Overall, the approach achieves acceptable evaluation scores, while maintaining flexibility between languages.
We present our submission and results for SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020) where we participated in offensive tweet classification tasks in English, Arabic, Greek, Turkish and Danish. Our approach included classical machine learning architectures such as support vector machines and logistic regression combined in an ensemble with a multilingual transformer-based model (XLM-R). The transformer model is trained on all languages combined in order to create a fully multilingual model which can leverage knowledge between languages. The machine learning model hyperparameters are fine-tuned and the statistically best performing ones included in the final ensemble.
The SemEval-2020 Task 12 (OffensEval) challenge focuses on detection of signs of offensiveness using posts or comments over social media. This task has been organized for several languages, e.g., Arabic, Danish, English, Greek and Turkish. It has featured three related sub-tasks for English language: sub-task A was to discriminate between offensive and non-offensive posts, the focus of sub-task B was on the type of offensive content in the post and finally, in sub-task C, proposed systems had to identify the target of the offensive posts. The corpus for each of the languages is developed using the posts and comments over Twitter, a popular social media platform. We have participated in this challenge and submitted results for different languages. The current work presents different machine learning and deep learning techniques and analyzes their performance for offensiveness prediction which involves various classifiers and feature engineering schemes. The experimental analysis on the training set shows that SVM using language specific pre-trained word embedding (Fasttext) outperforms the other methods. Our system achieves a macro-averaged F1 score of 0.45 for Arabic language, 0.43 for Greek language and 0.54 for Turkish language.
This paper describes our team work and submission for the SemEval 2020 (Sub-Task A) “Offensive Eval: Identifying and Categorizing Offensive Arabic Language in Arabic Social Media”. Our two baseline models were based on different levels of representation: character vs. word level. In word level based representation we implemented a convolutional neural network model and a bi-directional GRU model. In character level based representation we implemented a hyper CNN and LSTM model. All of these models have been further augmented with attention layers for a better performance on our task. We also experimented with three types of static word embeddings: word2vec, FastText, and Glove, in addition to emoji embeddings, and compared the performance of the different deep learning models on the dataset provided by this task. The bi-directional GRU model with attention has achieved the highest score (0.85% F1 score) among all other models.
This paper describes the Duluth systems that participated in SemEval–2020 Task 12, Multilingual Offensive Language Identification in Social Media (OffensEval–2020). We participated in the three English language tasks. Our systems provide a simple machine learning baseline using logistic regression. We trained our models on the distantly supervised training data made available by the task organizers and used no other resources. As might be expected we did not rank highly in the comparative evaluation: 79th of 85 in task A, 34th of 43 in task B, and 24th of 39 in task C. We carried out a qualitative analysis of our results and found that the class labels in the gold standard data are somewhat noisy. We hypothesize that the extremely high accuracy (>$ 90%) of the top ranked systems may reflect methods that learn the training data very well but may not generalize to the task of identifying offensive language in English. This analysis includes examples of tweets that despite being mildly redacted are still offensive.
Indiscriminately posting offensive remarks on social media may promote the occurrence of negative events such as violence, crime, and hatred. This paper examines different approaches and models for solving offensive tweet classification, which is a part of the OffensEval 2020 competition. The dataset is Offensive Language Identification Dataset (OLID), which draws 14,200 annotated English Tweet comments. The main challenge of data preprocessing is the unbalanced class distribution, abbreviation, and emoji. To overcome these issues, methods such as hashtag segmentation, abbreviation replacement, and emoji replacement have been adopted for data preprocessing approaches. The main task can be divided into three sub-tasks, and are solved by Term Frequency–Inverse Document Frequency(TF-IDF), Bidirectional Encoder Representation from Transformer (BERT), and Multi-dropout respectively. Meanwhile, we applied different learning rates for different languages and tasks based on BERT and non-BERTmodels in order to obtain better results. Our team Ferryman ranked the 18th, 8th, and 21st with F1-score of 0.91152 on the English Sub-task A, Sub-task B, and Sub-task C, respectively. Furthermore, our team also ranked in the top 20 on the Sub-task A of other languages.
SemEval-2020 Task 12 was OffenseEval: Multilingual Offensive Language Identification inSocial Media (Zampieri et al., 2020). The task was subdivided into multiple languages anddatasets were provided for each one. The task was further divided into three sub-tasks: offensivelanguage identification, automatic categorization of offense types, and offense target identification.I participated in the task-C, that is, offense target identification. For preparing the proposed system,I made use of Deep Learning networks like LSTMs and frameworks like Keras which combine thebag of words model with automatically generated sequence based features and manually extractedfeatures from the given dataset. My system on training on 25% of the whole dataset achieves macro averaged f1 score of 47.763%.
In this paper, we present our participation in SemEval-2020 Task-12 Subtask-A (English Language) which focuses on offensive language identification from noisy labels. To this end, we developed a hybrid system with the BERT classifier trained with tweets selected using Statistical Sampling Algorithm (SA) and Post-Processed (PP) using an offensive wordlist. Our developed system achieved 34th position with Macro-averaged F1-score (Macro-F1) of 0.90913 over both offensive and non-offensive classes. We further show comprehensive results and error analysis to assist future research in offensive language identification with noisy labels.
This paper describes the systems developed for I2C Group to participate on Subtasks A and B in English, and Subtask A in Turkish and Arabic in OffensEval (Task 12 of SemEval 2020). In our experiments we compare three architectures we have developed, two based on Transformer and the other based on classical machine learning algorithms. In this paper, the proposed architectures are described, and the results obtained by our systems are presented.
We describe our submitted system to the SemEval 2020. We tackled Task 12 entitled “Multilingual Offensive Language Identification in Social Media”, specifically subtask 4A-Arabic. We propose three Arabic offensive language identification models: Tw-StAR, BERT and BERT+BiLSTM. Two Arabic abusive/hate datasets were added to the training dataset: L-HSAB and T-HSAB. The final submission was chosen based on the best performances which was achieved by the BERT+BiLSTM model.
In this paper, we describe the participation of IITP-AINLPML team in the SemEval-2020 SharedTask 12 on Offensive Language Identification and Target Categorization in English Twitter data. Our proposed model learns to extract textual features using a BiGRU-based deep neural network supported by a Hierarchical Attention architecture to focus on the most relevant areas in the text. We leverage the effectiveness of multitask learning while building our models for sub-task A and B. We do necessary undersampling of the over-represented classes in the sub-tasks A and C.During training, we consider a threshold of 0.5 as the separation margin between the instances belonging to classes OFF and NOT in sub-task A and UNT and TIN in sub-task B. For sub-task C, the class corresponding to the maximum score among the given confidence scores of the classes(IND, GRP and OTH) is considered as the final label for an instance. Our proposed model obtains the macro F1-scores of 90.95%, 55.69% and 63.88% in sub-task A, B and C, respectively.
This paper describes our participation in OffensEval challenges for English, Arabic, Danish, Turkish, and Greek languages. We used several approaches, such as μTC, TextCategorization, and EvoMSA. Best results were achieved with EvoMSA, which is a multilingual and domain-independent architecture that combines the prediction from different knowledge sources to solve text classification problems.
In this paper, we present our approach and the results of our participation in OffensEval 2020. There are three sub-tasks in OffensEval 2020 namely offensive language identification (sub-task A), automatic categorization of offense types (sub-task B), and offense target identification (sub-task C). We participated in sub-task A of English OffensEval 2020. Our approach emphasizes on how the emoji affects offensive language identification. Our model used LSTM combined with GloVe pre-trained word vectors to identify offensive language on social media. The best model obtained macro F1-score of 0.88428.
The paper describes systems that our team IRLab_DAIICT employed for shared task OffensEval2020: Multilingual Offensive Language Identification in Social Media shared task. We conducted experiments on the English language dataset which contained weakly labelled data. There were three sub-tasks but we only participated in sub-tasks A and B. We employed Machine learning techniques like Logistic Regression, Support Vector Machine, Random Forest and Deep learning techniques like Convolutional Neural Network and BERT. Our best approach achieved a MacroF1 score of 0.91 for sub-task A and 0.64 for sub-task B.
This paper describes the IRlab@IIT-BHU system for the OffensEval 2020. We take the SVM with TF-IDF features to identify and categorize hate speech and offensive language in social media for two languages. In subtask A, we used a linear SVM classifier to detect abusive content in tweets, achieving a macro F1 score of 0.779 and 0.718 for Arabic and Greek, respectively.
In this paper, we describe our submissions to SemEval-2020 contest. We tackled subtask 12 - “Multilingual Offensive Language Identification in Social Media”. We developed different models for four languages: Arabic, Danish, Greek, and Turkish. We applied three supervised machine learning methods using various combinations of character and word n-gram features. In addition, we applied various combinations of basic preprocessing methods. Our best submission was a model we built for offensive language identification in Danish using Random Forest. This model was ranked at the 6 position out of 39 submissions. Our result is lower by only 0.0025 than the result of the team that won the 4 place using entirely non-neural methods. Our experiments indicate that char ngram features are more helpful than word ngram features. This phenomenon probably occurs because tweets are more characterized by characters than by words, tweets are short, and contain various special sequences of characters, e.g., hashtags, shortcuts, slang words, and typos.
This paper presents the approach of Team KAFK for the English edition of SemEval-2020 Task 12. We use checkpoint ensembling to create ensembles of BERT-based transformers and show that it can improve the performance of classification systems. We explore attention mask dropout to mitigate for the poor constructs of social media texts. Our classifiers scored macro-f1 of 0.909, 0.551 and 0.616 for subtasks A, B and C respectively. The code is publicly released online.
In recent years, with the development of social network services and video distribution services, there has been a sharp increase in offensive posts. In this paper, we present our approach for detecting hate speech in tweets defined in the SemEval- 2020 Task 12. Our system precise classification by using features extracted from two different layers of a pre-trained model, the BERT-large, and ensemble them.
This research presents our team KEIS@JUST participation at SemEval-2020 Task 12 which represents shared task on multilingual offensive language. We participated in all the provided languages for all subtasks except sub-task-A for the English language. Two main approaches have been developed the first is performed to tackle both languages Arabic and English, a weighted ensemble consists of Bi-GRU and CNN followed by Gaussian noise and global pooling layer multiplied by weights to improve the overall performance. The second is performed for other languages, a transfer learning from BERT beside the recurrent neural networks such as Bi-LSTM and Bi-GRU followed by a global average pooling layer. Word embedding and contextual embedding have been used as features, moreover, data augmentation has been used only for the Arabic language.
This paper describes the KS@LTH system for SemEval-2020 Task 12 OffensEval2: Multilingual Offensive Language Identification in Social Media. We compare mono- and multilingual models based on fine-tuning pre-trained transformer models for offensive language identification in Arabic, Greek, English and Turkish. For Danish, we explore the possibility of fine-tuning a model pre-trained on a similar language, Swedish, and additionally also cross-lingual training together with English.
In this paper, we describe our approach to utilize pre-trained BERT models with Convolutional Neural Networks for sub-task A of the Multilingual Offensive Language Identification shared task (OffensEval 2020), which is a part of the SemEval 2020. We show that combining CNN with BERT is better than using BERT on its own, and we emphasize the importance of utilizing pre-trained language models for downstream tasks. Our system, ranked 4th with macro averaged F1-Score of 0.897 in Arabic, 4th with score of 0.843 in Greek, and 3rd with score of 0.814 in Turkish. Additionally, we present ArabicBERT, a set of pre-trained transformer language models for Arabic that we share with the community.
Nowadays, offensive content in social media has become a serious problem, and automatically detecting offensive language is an essential task. In this paper, we build an offensive language detection system, which combines multi-task learning with BERT-based models. Using a pre-trained language model such as BERT, we can effectively learn the representations for noisy text in social media. Besides, to boost the performance of offensive language detection, we leverage the supervision signals from other related tasks. In the OffensEval-2020 competition, our model achieves 91.51% F1 score in English Sub-task A, which is comparable to the first place (92.23%F1). An empirical analysis is provided to explain the effectiveness of our approaches.
This article describes the system submitted to SemEval 2020 Task 12: OffensEval 2020. This task aims to identify and classify offensive languages in different languages on social media. We only participate in the English part of subtask A, which aims to identify offensive languages in English. To solve this task, we propose a BERT model system based on the transform mechanism, and use the maximum self-ensemble to improve model performance. Our model achieved a macro F1 score of 0.913(ranked 13/82) in subtask A.
This paper presents our system entitled ‘LIIR’ for SemEval-2020 Task 12 on Multilingual Offensive Language Identification in Social Media (OffensEval 2). We have participated in sub-task A for English, Danish, Greek, Arabic, and Turkish languages. We adapt and fine-tune the BERT and Multilingual Bert models made available by Google AI for English and non-English languages respectively. For the English language, we use a combination of two fine-tuned BERT models. For other languages we propose a cross-lingual augmentation approach in order to enrich training data and we use Multilingual BERT to obtain sentence representations.
AraBERT is an Arabic version of the state-of-the-art Bidirectional Encoder Representations from Transformers (BERT) model. The latter has achieved good performance in a variety of Natural Language Processing (NLP) tasks. In this paper, we propose an effective AraBERT embeddings-based method for dealing with offensive Arabic language in Twitter. First, we pre-process tweets by handling emojis and including their Arabic meanings. To overcome the pretrain-finetune discrepancy, we substitute each detected emojis by the special token [MASK] into both fine tuning and inference phases. Then, we represent tweets tokens by applying AraBERT model. Finally, we feed the tweet representation into a sigmoid function to decide whether a tweet is offensive or not. The proposed method achieved the best results on OffensEval 2020: Arabic task and reached a macro F1 score equal to 90.17%.
In this paper, we present the system submitted to “SemEval-2020 Task 12”. The proposed system aims at automatically identify the Offensive Language in Arabic Tweets. A machine learning based approach has been used to design our system. We implemented a linear classifier with Stochastic Gradient Descent (SGD) as optimization algorithm. Our model reported 84.20%, 81.82% f1-score on development set and test set respectively. The best performed system and the system in the last rank reported 90.17% and 44.51% f1-score on test set respectively.
This paper describes a neural network (NN) model that was used for participating in the OffensEval, Task 12 of the SemEval 2020 workshop. The aim of this task is to identify offensive speech in social media, particularly in tweets. The model we used, C-BiGRU, is composed of a Convolutional Neural Network (CNN) along with a bidirectional Recurrent Neural Network (RNN). A multidimensional numerical representation (embedding) for each of the words in the tweets that were used by the model were determined using fastText. This allowed for using a dataset of labeled tweets to train the model on detecting combinations of words that may convey an offensive meaning. This model was used in the sub-task A of the English, Turkish and Danish competitions of the workshop, achieving F1 scores of 90.88%, 76.76% and 76.70%, respectively.
In this paper, we introduce our submission for the SemEval Task 12, sub-tasks A and B for offensive language identification and categorization in English tweets. This year the data set for Task A is significantly larger than in the previous year. Therefore, we have adapted the BlazingText algorithm to extract embedding representation and classify texts after filtering and sanitizing the dataset according to the conventional text patterns on social media. We have gained both advantages of a speedy training process and obtained a good F1 score of 90.88% on the test set. For sub-task B, we opted to fine-tune a Bidirectional Encoder Representation from a Transformer (BERT) to accommodate the limited data for categorizing offensive tweets. We have achieved an F1 score of only 56.86%, but after experimenting with various label assignment thresholds in the pre-processing steps, the F1 score improved to 64%.
This paper presents our hierarchical multi-task learning (HMTL) and multi-task learning (MTL) approaches for improving the text encoder in Sub-tasks A, B, and C of Multilingual Offensive Language Identification in Social Media (SemEval-2020 Task 12). We show that using the MTL approach can greatly improve the performance of complex problems, i.e. Sub-tasks B and C. Coupled with a hierarchical approach, the performances are further improved. Overall, our best model, HMTL outperforms the baseline model by 3% and 2% of Macro F-score in Sub-tasks B and C of OffensEval 2020, respectively.
The paper presents a system developed for the SemEval-2020 competition Task 12 (OffensEval-2): Multilingual Offensive Language Identification in Social Media. We achieve the second place (2nd) in sub-task B: Automatic categorization of offense types and are ranked 55th with a macro F1-score of 90.59 in sub-task A: Offensive language identification. Our solution is using a stack of BERT and LSTM layers, training with the Noisy Student method. Since the tweets data contains a large number of noisy words and slang, we update the vocabulary of the BERT large model pre-trained by the Google AI Language team. We fine-tune the model with tweet sentences provided in the challenge.
This paper describes a system (pin_cod_) built for SemEval 2020 Task 12: OffensEval: Multilingual Offensive Language Identification in Social Media (Zampieri et al., 2020). I present the system based on the architecture of bidirectional long short-term memory networks (BiLSTM) concatenated with lexicon-based features and a social-network specific feature and then followed by two fully connected dense layers for detecting Turkish offensive tweets. The pin cod ’s system achieved a macro F1-score of 0.7496 for Sub-task A - Offensive language identification in Turkish.
In this paper, we present various systems submitted by our team problemConquero for SemEval-2020 Shared Task 12 “Multilingual Offensive Language Identification in Social Media”. We participated in all the three sub-tasks of OffensEval-2020, and our final submissions during the evaluation phase included transformer-based approaches and a soft label-based approach. BERT based fine-tuned models were submitted for each language of sub-task A (offensive tweet identification). RoBERTa based fine-tuned model for sub-task B (automatic categorization of offense types) was submitted. We submitted two models for sub-task C (offense target identification), one using soft labels and the other using BERT based fine-tuned model. Our ranks for sub-task A were Greek-19 out of 37, Turkish-22 out of 46, Danish-26 out of 39, Arabic-39 out of 53, and English-20 out of 85. We achieved a rank of 28 out of 43 for sub-task B. Our best rank for sub-task C was 20 out of 39 using BERT based fine-tuned model.
This paper describes SalamNET, an Arabic offensive language detection system that has been submitted to SemEval 2020 shared task 12: Multilingual Offensive Language Identification in Social Media. Our approach focuses on applying multiple deep learning models and conducting in depth error analysis of results to provide system implications for future development considerations. To pursue our goal, a Recurrent Neural Network (RNN), a Gated Recurrent Unit (GRU), and Long-Short Term Memory (LSTM) models with different design architectures have been developed and evaluated. The SalamNET, a Bi-directional Gated Recurrent Unit (Bi-GRU) based model, reports a macro-F1 score of 0.83%
This paper discusses how ML based classifiers can be enhanced disproportionately by adding small amounts of qualitative linguistic knowledge. As an example we present the Danish classifier Smatgrisene, our contribution to the recent OffensEval Challenge 2020. The classifier was trained on 3000 social media posts annotated for offensiveness, supplemented by rules extracted from the reference work on Danish offensive language (Rathje 2014b). Smatgrisene did surprisingly well in the competition in spite of its extremely simple design, showing an interesting trade-off between technological muscle and linguistic intelligence. Finally, we comment on the perspectives in combining qualitative and quantitative methods for NLP.
In this paper, we present our approaches and results for SemEval-2020 Task 12, Multilingual Offensive Language Identification in Social Media (OffensEval 2020). The OffensEval 2020 had three subtasks: A) Identifying the tweets to be offensive (OFF) or non-offensive (NOT) for Arabic, Danish, English, Greek, and Turkish languages, B) Detecting if the offensive tweet is targeted (TIN) or untargeted (UNT) for the English language, and C) Categorizing the offensive targeted tweets into three classes, namely: individual (IND), Group (GRP), or Other (OTH) for the English language. We participate in all the subtasks A, B, and C. In our solution, first we use the pre-trained BERT model for all subtasks, A, B, and C and then we apply the BiLSTM model with attention mechanism (Attn-BiLSTM) for the same. Our result demonstrates that the pre-trained model is not giving good results for all types of languages and is compute and memory intensive whereas the Attn-BiLSTM model is fast and gives good accuracy with fewer resources. The Attn-BiLSTM model is giving better accuracy for Arabic and Greek where the pre-trained model is not able to capture the complete context of these languages due to lower vocab-size.
Offensive language identification (OLI) in user generated text is automatic detection of any profanity, insult, obscenity, racism or vulgarity that is addressed towards an individual or a group. Due to immense growth and usage of social media, it has an extensive reach and impact on the society. OLI is helpful for hate speech detection, flame detection and cyber bullying, hence it is used to avoid abuse and hurts. In this paper, we present state of the art machine learning approaches for OLI. We follow several approaches which include classifiers like Naive Bayes, Support Vector Machine(SVM) and deep learning approaches like Recurrent Neural Network(RNN) and Masked LM (MLM). The approaches are evaluated on the OffensEval@SemEval2020 dataset and our team ssn_nlp submitted runs for the third task of OffensEval shared task. The best run of ssn_nlp that uses BERT (Bidirectional Encoder Representations from Transformers) for the purpose of training the OLI model obtained F1 score as 0.61. The model performs with an accuracy of 0.80 and an evaluation loss of 1.0828. The model has a precision rate of 0.72 and a recall rate of 0.58.
Offensive language identification is to detect the hurtful tweets, derogatory comments, swear words on social media. As an emerging growth of social media communication, offensive language detection has received more attention in the last years; we focus to perform the task on English, Danish and Greek. We have investigated which can be effect more on pre-trained models BERT (Bidirectional Encoder Representation from Transformer) and Machine Learning Approaches. Our investigation shows the difference performance between the three languages and to identify the best performance is evaluated by the classification algorithms. In the shared task SemEval-2020, our team SSN_NLP_MLRG submitted for three languages that are Subtasks A, B, C in English, Subtask A in Danish and Subtask A in Greek. Our team SSN_NLP_MLRG obtained the F1 Scores as 0.90, 0.61, 0.52 for the Subtasks A, B, C in English, 0.56 for the Subtask A in Danish and 0.67 for the Subtask A in Greek respectively.
This paper summarizes our group’s efforts in the offensive language identification shared task, which is organized as part of the International Workshop on Semantic Evaluation (Sem-Eval2020). Our final submission system is an ensemble of three different models, (1) CNN-LSTM, (2) BiLSTM-Attention and (3) BERT. Word embeddings, which were pre-trained on tweets, are used while training the first two models. BERTurk, which is the first BERT model for Turkish, is also explored. Our final submitted approach ranked as the second best model in the Turkish sub-task.
Usage of offensive language on social media is getting more common these days, and there is a need of a mechanism to detect it and control it. This paper deals with offensive language detection in five different languages; English, Arabic, Danish, Greek and Turkish. We presented an almost similar ensemble pipeline comprised of machine learning and deep learning models for all five languages. Three machine learning and four deep learning models were used in the ensemble. In the OffensEval-2020 competition our model achieved F1-score of 0.85, 0.74, 0.68, 0.81, and 0.9 for Arabic, Turkish, Danish, Greek and English language tasks respectively.
With the growing use of social media and its availability, many instances of the use of offensive language have been observed across multiple languages and domains. This phenomenon has given rise to the growing need to detect the offensive language used in social media cross-lingually. In OffensEval 2020, the organizers have released the multilingual Offensive Language Identification Dataset (mOLID), which contains tweets in five different languages, to detect offensive language. In this work, we introduce a cross-lingual inductive approach to identify the offensive language in tweets using the contextual word embedding XLM-RoBERTa (XLM-R). We show that our model performs competitively on all five languages, obtaining the fourth position in the English task with an F1-score of 0.919 and eighth position in the Turkish task with an F1-score of 0.781. Further experimentation proves that our model works competitively in a zero-shot learning environment, and is extensible to other languages.
This paper describes the work of identifying the presence of offensive language in social media posts and categorizing a post as targeted to a particular person or not. The work developed by team TECHSSN for solving the Multilingual Offensive Language Identification in Social Media (Task 12) in SemEval-2020 involves the use of deep learning models with BERT embeddings. The dataset is preprocessed and given to a Bidirectional Encoder Representations from Transformers (BERT) model with pretrained weight vectors. The model is retrained and the weights are learned for the offensive language dataset. We have developed a system with the English language dataset. The results are better when compared to the model we developed in SemEval-2019 Task6.
Hate speech detection on social media platforms is crucial as it helps to avoid severe situations, and severe harm to marginalized people and groups. The application of Natural Language Processing(NLP) and Deep Learning has garnered encouraging results in the task of hate speech detection. The expression of hate, however is varied and ever evolving. Thus, it is important for better detection systems to adapt to this variance. Because of this, researchers keep on collecting data and regularly come up with hate speech detection competitions. In this paper, we discuss our entry to one such competition, namely the English version of sub-task A for the OffensEval competition. Our contribution can be perceived through our results, which were first a F1-score of 0.9089, and with further refinements described here climb up to0.9166. It serves to give more support to our hypothesis that one of the variants of BERT (Devlin et al., 2018), namely RoBERTa can successfully differentiate between offensive and not-offensive tweets, given some preprocessing steps (also outlined here).
In this paper, we built several pre-trained models to participate SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media. In the common task of Offensive Language Identification in Social Media, pre-trained models such as Bidirectional Encoder Representation from Transformer (BERT) have achieved good results. We preprocess the dataset by the language habits of users in social network. Considering the data imbalance in OffensEval, we screened the newly provided machine annotation samples to construct a new dataset. We use the dataset to fine-tune the Robustly Optimized BERT Pretraining Approach (RoBERTa). For the English subtask B, we adopted the method of adding Auxiliary Sentences (AS) to transform the single-sentence classification task into a relationship recognition task between sentences. Our team UJNLP wins the ranking 16th of 85 in English subtask A (Offensive language identification).
This paper outlines our approach to Tasks A & B for the English Language track of SemEval-2020 Task 12: OffensEval 2: Multilingual Offensive Language Identification in Social Media. We use a Linear SVM with document vectors computed from pre-trained word embeddings, and we explore the effectiveness of lexical, part of speech, dependency, and named entity (NE) features. We manually annotate a subset of the training data, which we use for error analysis and to tune a threshold for mapping training confidence values to labels. While document vectors are consistently the most informative features for both tasks, testing on the development set suggests that dependency features are an effective addition for Task A, and NE features for Task B.
Pre-trained language model word representation, such as BERT, have been extremely successful in several Natural Language Processing tasks significantly improving on the state-of-the-art. This can largely be attributed to their ability to better capture semantic information contained within a sentence. Several tasks, however, can benefit from information available at a corpus level, such as Term Frequency-Inverse Document Frequency (TF-IDF). In this work we test the effectiveness of integrating this information with BERT on the task of identifying abuse on social media and show that integrating this information with BERT does indeed significantly improve performance. We participate in Sub-Task A (abuse detection) wherein we achieve a score within two points of the top performing team and in Sub-Task B (target detection) wherein we are ranked 4 of the 44 participating teams.
Offensive language detection is one of the most challenging problem in the natural language processing field, being imposed by the rising presence of this phenomenon in online social media. This paper describes our Transformer-based solutions for identifying offensive language on Twitter in five languages (i.e., English, Arabic, Danish, Greek, and Turkish), which was employed in Subtask A of the Offenseval 2020 shared task. Several neural architectures (i.e., BERT, mBERT, Roberta, XLM-Roberta, and ALBERT), pre-trained using both single-language and multilingual corpora, were fine-tuned and compared using multiple combinations of datasets. Finally, the highest-scoring models were used for our submissions in the competition, which ranked our team 21st of 85, 28th of 53, 19th of 39, 16th of 37, and 10th of 46 for English, Arabic, Danish, Greek, and Turkish, respectively.
Offensive language is a common issue on social media platforms nowadays. In an effort to address this issue, the SemEval 2020 event held the OffensEval 2020 shared task where the participants were challenged to develop systems that identify and classify offensive language in tweets. In this paper, we present a system that uses an Ensemble model stacking a BOW model and a CNN model that led us to place 29th in the ranking for English sub-task A.
Communicating through social platforms has become one of the principal means of personal communications and interactions. Unfortunately, healthy communication is often interfered by offensive language that can have damaging effects on the users. A key to fight offensive language on social media is the existence of an automatic offensive language detection system. This paper presents the results and the main findings of SemEval-2020, Task 12 OffensEval Sub-task A Zampieri et al. (2020), on Identifying and categorising Offensive Language in Social Media. The task was based on the Arabic OffensEval dataset Mubarak et al. (2020). In this paper, we describe the system submitted by WideBot AI Lab for the shared task which ranked 10th out of 52 participants with Macro-F1 86.9% on the golden dataset under CodaLab username “yasserotiefy”. We experimented with various models and the best model is a linear SVM in which we use a combination of both character and word n-grams. We also introduced a neural network approach that enhanced the predictive ability of our system that includes CNN, highway network, Bi-LSTM, and attention layers.
This paper presents six document classification models using the latest transformer encoders and a high-performing ensemble model for a task of offensive language identification in social media. For the individual models, deep transformer layers are applied to perform multi-head attentions. For the ensemble model, the utterance representations taken from those individual models are concatenated and fed into a linear decoder to make the final decisions. Our ensemble model outperforms the individual models and shows up to 8.6% improvement over the individual models on the development set. On the test set, it achieves macro-F1 of 90.9% and becomes one of the high performing systems among 85 participants in the sub-task A of this shared task. Our analysis shows that although the ensemble model significantly improves the accuracy on the development set, the improvement is not as evident on the test set.
This article describes the system submitted to SemEval-2020 Task 12 OffensEval 2: Multilingual Offensive Language Recognition in Social Media. The task is to classify offensive language in social media. The shared task contains five languages (English, Greek, Arabic, Danish, and Turkish) and three subtasks. We only participated in subtask A of English to identify offensive language. To solve this task, we proposed a system based on a Bidirectional Gated Recurrent Unit (Bi-GRU) with a Capsule model. Finally, we used the K-fold approach for ensemble. Our model achieved a Macro-average F1 score of 0.90969 (ranked 27/85) in subtask A.
Domain knowledge is important to understand both the lexical and relational associations of words in natural language text, especially for domain-specific tasks like Natural Language Inference (NLI) in the medical domain, where due to the lack of a large annotated dataset such knowledge cannot be implicitly learned during training. However, because of the linguistic idiosyncrasies of clinical texts (e.g., shorthand jargon), solely relying on domain knowledge from an external knowledge base (e.g., UMLS) can lead to wrong inference predictions as it disregards contextual information and, hence, does not return the most relevant mapping. To remedy this, we devise a knowledge adaptive approach for medical NLI that encodes the premise/hypothesis texts by leveraging supplementary external knowledge, alongside the UMLS, based on the word contexts. By incorporating refined domain knowledge at both the lexical and relational levels through a multi-source attention mechanism, it is able to align the token-level interactions between the premise and hypothesis more effectively. Comprehensive experiments and case study on the recently released MedNLI dataset are conducted to validate the effectiveness of the proposed approach.
In the recent past, Natural language Inference (NLI) has gained significant attention, particularly given its promise for downstream NLP tasks. However, its true impact is limited and has not been well studied. Therefore, in this paper, we explore the utility of NLI for one of the most prominent downstream tasks, viz. Question Answering (QA). We transform one of the largest available MRC dataset (RACE) to an NLI form, and compare the performances of a state-of-the-art model (RoBERTa) on both these forms. We propose new characterizations of questions, and evaluate the performance of QA and NLI models on these categories. We highlight clear categories for which the model is able to perform better when the data is presented in a coherent entailment form, and a structured question-answer concatenation form, respectively.
Tackling Natural Language Inference with a logic-based method is becoming less and less common. While this might have been counterintuitive several decades ago, nowadays it seems pretty obvious. The main reasons for such a conception are that (a) logic-based methods are usually brittle when it comes to processing wide-coverage texts, and (b) instead of automatically learning from data, they require much of manual effort for development. We make a step towards to overcome such shortcomings by modeling learning from data as abduction: reversing a theorem-proving procedure to abduce semantic relations that serve as the best explanation for the gold label of an inference problem. In other words, instead of proving sentence-level inference relations with the help of lexical relations, the lexical relations are proved taking into account the sentence-level inference relations. We implement the learning method in a tableau theorem prover for natural language and show that it improves the performance of the theorem prover on the SICK dataset by 1.4% while still maintaining high precision (>94%). The obtained results are competitive with the state of the art among logic-based systems.
Collecting modality exclusivity norms for lexical items has recently become a common practice in psycholinguistics and cognitive research. However, these norms are available only for a relatively small number of languages and often involve a costly and time-consuming collection of ratings. In this work, we aim at learning a mapping between word embeddings and modality norms. Our experiments focused on crosslingual word embeddings, in order to predict modality association scores by training on a high-resource language and testing on a low-resource one. We ran two experiments, one in a monolingual and the other one in a crosslingual setting. Results show that modality prediction using off-the-shelf crosslingual embeddings indeed has moderate-to-high correlations with human ratings even when regression algorithms are trained on an English resource and tested on a completely unseen language.
In this paper, we propose a novel method for learning cross-lingual word embeddings, that incorporates sub-word information during training, and is able to learn high-quality embeddings from modest amounts of monolingual data and a bilingual lexicon. This method could be particularly well-suited to learning cross-lingual embeddings for lower-resource, morphologically-rich languages, enabling knowledge to be transferred from rich- to lower-resource languages. We evaluate our proposed approach simulating lower-resource languages for bilingual lexicon induction, monolingual word similarity, and document classification. Our results indicate that incorporating sub-word information indeed leads to improvements, and in the case of document classification, performance better than, or on par with, strong benchmark approaches.
Building on recent advances in semantic parsing and text simplification, we investigate the use of semantic splitting of the source sentence as preprocessing for machine translation. We experiment with a Transformer model and evaluate using large-scale crowd-sourcing experiments. Results show a significant increase in fluency on long sentences on an English-to- French setting with a training corpus of 5M sentence pairs, while retaining comparable adequacy. We also perform a manual analysis which explores the tradeoff between adequacy and fluency in the case where all sentence lengths are considered.
Emotion stimulus detection is the task of finding the cause of an emotion in a textual description, similar to target or aspect detection for sentiment analysis. Previous work approached this in three ways, namely (1) as text classification into an inventory of predefined possible stimuli (“Is the stimulus category A or B?”), (2) as sequence labeling of tokens (“Which tokens describe the stimulus?”), and (3) as clause classification (“Does this clause contain the emotion stimulus?”). So far, setting (3) has been evaluated broadly on Mandarin and (2) on English, but no comparison has been performed. Therefore, we analyze whether clause classification or token sequence labeling is better suited for emotion stimulus detection in English. We propose an integrated framework which enables us to evaluate the two different approaches comparably, implement models inspired by state-of-the-art approaches in Mandarin, and test them on four English data sets from different domains. Our results show that token sequence labeling is superior on three out of four datasets, in both clause-based and token sequence-based evaluation. The only case in which clause classification performs better is one data set with a high density of clause annotations. Our error analysis further confirms quantitatively and qualitatively that clauses are not the appropriate stimulus unit in English.
Operationalizing morality is crucial for understanding multiple aspects of society that have moral values at their core – such as riots, mobilizing movements, public debates, etc. Moral Foundations Theory (MFT) has become one of the most adopted theories of morality partly due to its accompanying lexicon, the Moral Foundation Dictionary (MFD), which offers a base for computationally dealing with morality. In this work, we exploit the MFD in a novel direction by investigating how well moral values are captured by KGs. We explore three widely used KGs, and provide concept-level analogues for the MFD. Furthermore, we propose several Personalized PageRank variations in order to score all the concepts and entities in the KGs with respect to their relevance to the different moral values. Our promising results help to progress the operationalization of morality in both NLP and KG communities.
There is growing evidence that the prevalence of disagreement in the raw annotations used to construct natural language inference datasets makes the common practice of aggregating those annotations to a single label problematic. We propose a generic method that allows one to skip the aggregation step and train on the raw annotations directly without subjecting the model to unwanted noise that can arise from annotator response biases. We demonstrate that this method, which generalizes the notion of a mixed effects model by incorporating annotator random effects into any existing neural model, improves performance over models that do not incorporate such effects.
Contextualized word representations have become a driving force in NLP, motivating widespread interest in understanding their capabilities and the mechanisms by which they operate. Particularly intriguing is their ability to identify and encode conceptual abstractions. Past work has probed BERT representations for this competence, finding that BERT can correctly retrieve noun hypernyms in cloze tasks. In this work, we ask the question: do probing studies shed light on systematic knowledge in BERT representations? As a case study, we examine hypernymy knowledge encoded in BERT representations. In particular, we demonstrate through a simple consistency probe that the ability to correctly retrieve hypernyms in cloze tasks, as used in prior work, does not correspond to systematic knowledge in BERT. Our main conclusion is cautionary: even if BERT demonstrates high probing accuracy for a particular competence, it does not necessarily follow that BERT ‘understands’ a concept, and it cannot be expected to systematically generalize across applicable contexts.
The manifold hypothesis suggests that word vectors live on a submanifold within their ambient vector space. We argue that we should, more accurately, expect them to live on a <i>pinched</i> manifold: a singular quotient of a manifold obtained by identifying some of its points. The identified, singular points correspond to polysemous words, i.e. words with multiple meanings. Our point of view suggests that monosemous and polysemous words can be distinguished based on the topology of their neighbourhoods. We present two kinds of empirical evidence to support this point of view: (1) We introduce a topological measure of polysemy based on persistent homology that correlates well with the actual number of meanings of a word. (2) We propose a simple, topologically motivated solution to the SemEval-2010 task on <i>Word Sense Induction & Disambiguation</i> that produces competitive results.
Co-predication is one of the most frequently used linguistic tests to tell apart shifts in polysemic sense from changes in homonymic meaning. It is increasingly coming under criticism as evidence is accumulating that it tends to mis-classify specific cases of polysemic sense alteration as homonymy. In this paper, we collect empirical data to investigate these accusations. We asses how co-predication acceptability relates to explicit ratings of polyseme word sense similarity, and how well either measure can be predicted through the distance between target words’ contextualised word embeddings. We find that sense similarity appears to be a major contributor in determining co-predication acceptability, but that co-predication judgements tend to rate especially less similar sense interpretations equally as unacceptable as homonym pairs, effectively mis-classifying these instances. The tested contextualised word embeddings fail to predict word sense similarity consistently, but the similarities between BERT embeddings show a significant correlation with co-predication ratings. We take this finding as evidence that BERT embeddings might be better representations of context than encodings of word meaning.
Explanation generation introduced as the world tree corpus (Jansen et al., 2018) is an emerging NLP task involving multi-hop inference for explaining the correct answer in multiple-choice QA. It is a challenging task evidenced by low state-of-the-art performances(below 60% in F-score) demonstrated on the task. Of the state-of-the-art approaches, fine-tuned transformer-based (Vaswani et al., 2017) BERT models have shown great promise toward continued system performance improvements compared with approaches relying on surface-level cues alone that demonstrate performance saturation. In this work, we take a novel direction by addressing a particular linguistic characteristic of the data — we introduce a novel and lightweight focus feature in the transformer-based model and examine task improvements. Our evaluations reveal a significantly positive impact of this lightweight focus feature achieving the highest scores, second only to a significantly computationally intensive system.
Our paper offers a computational model of the semantic recoverability of verb arguments, tested in particular on direct objects and Instruments. Our fully distributional model is intended to improve on older taxonomy-based models, which require a lexicon in addition to the training corpus. We computed the selectional preferences of 99 transitive verbs and 173 Instrument verbs as the mean value of the pairwise cosines between their arguments (a weighted mean between all the arguments, or an unweighted mean with the topmost k arguments). Results show that our model can predict the recoverability of objects and Instruments, providing a similar result to that of taxonomy-based models but at a much cheaper computational cost.
We present a semi-supervised model which learns the semantics of negation purely through analysis of syntactic structure. Linguistic theory posits that the semantics of negation can be understood purely syntactically, though recent research relies on combining a variety of features including part-of-speech tags, word embeddings, and semantic representations to achieve high task performance. Our simplified model returns to syntactic theory and achieves state-of-the-art performance on the task of Negation Scope Detection while demonstrating the tight relationship between the syntax and semantics of negation.
We introduce a new dataset for training and evaluating grounded language models. Our data is collected within a virtual reality environment and is designed to emulate the quality of language data to which a pre-verbal child is likely to have access: That is, naturalistic, spontaneous speech paired with richly grounded visuospatial context. We use the collected data to compare several distributional semantics models for verb learning. We evaluate neural models based on 2D (pixel) features as well as feature-engineered models based on 3D (symbolic, spatial) features, and show that neither modeling approach achieves satisfactory performance. Our results are consistent with evidence from child language acquisition that emphasizes the difficulty of learning verbs from naive distributional data. We discuss avenues for future work on cognitively-inspired grounded language learning, and release our corpus with the intent of facilitating research on the topic.
Dialog state tracking (DST) is a core component in task-oriented dialog systems. Existing approaches for DST mainly fall into one of two categories, namely, ontology-based and ontology-free methods. An ontology-based method selects a value from a candidate-value list for each target slot, while an ontology-free method extracts spans from dialog contexts. Recent work introduced a BERT-based model to strike a balance between the two methods by pre-defining categorical and non-categorical slots. However, it is not clear enough which slots are better handled by either of the two slot types, and the way to use the pre-trained model has not been well investigated. In this paper, we propose a simple yet effective dual-strategy model for DST, by adapting a single BERT-style reading comprehension model to jointly handle both the categorical and non-categorical slots. Our experiments on the MultiWOZ datasets show that our method significantly outperforms the BERT-based counterpart, finding that the key is a deep interaction between the domain-slot and context information. When evaluated on noisy (MultiWOZ 2.0) and cleaner (MultiWOZ 2.1) settings, our method performs competitively and robustly across the two different settings. Our method sets the new state of the art in the noisy setting, while performing more robustly than the best model in the cleaner setting. We also conduct a comprehensive error analysis on the dataset, including the effects of the dual strategy for each slot, to facilitate future research.
We examine a new commonsense reasoning task: given a narrative describing a social interaction that centers on two protagonists, systems make inferences about the underlying relationship trajectory. Specifically, we propose two evaluation tasks: Relationship Outlook Prediction MCQ and Resolution Prediction MCQ. In Relationship Outlook Prediction, a system maps an interaction to a relationship outlook that captures how the interaction is expected to change the relationship. In Resolution Prediction, a system attributes a given relationship outlook to a particular resolution that explains the outcome. These two tasks parallel two real-life questions that people frequently ponder upon as they navigate different social situations: “where is this relationship going?” and “how did we end up here?”. To facilitate the investigation of human social relationships through these two tasks, we construct a new dataset, Social Narrative Tree, which consists of 1250 stories documenting a variety of daily social interactions. The narratives encode a multitude of social elements that interweave to give rise to rich commonsense knowledge of how relationships evolve with respect to social interactions. We establish baseline performances using language models and the accuracies are significantly lower than human performance. The results demonstrate that models need to look beyond syntactic and semantic signals to comprehend complex human relationships.
Author obfuscation is the task of masking the author of a piece of text, with applications in privacy. Recent advances in deep neural networks have boosted author identification performance making author obfuscation more challenging. Existing approaches to author obfuscation are largely heuristic. Obfuscation can, however, be thought of as the construction of adversarial examples to attack author identification, suggesting that the deep learning architectures used for adversarial attacks could have application here. Current architectures are proposed to construct adversarial examples against classification-based models, which in author identification would exclude the high-performing similarity-based models employed when facing large number of authorial classes. In this paper, we propose the first deep learning architecture for constructing adversarial examples against similarity-based learners, and explore its application to author obfuscation. We analyse the output from both success in obfuscation and language acceptability, as well as comparing the performance with some common baselines, and showing promising results in finding a balance between safety and soundness of the perturbed texts.