Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

Atul Kr. Ojha, A. Seza Doğruöz, Giovanni Da San Martino, Harish Tayyar Madabushi, Ritesh Kumar, Elisa Sartori (Editors)

Anthology ID:: 2023.semeval-1
Month:: July
Year:: 2023
Address:: Toronto, Canada
Venue:: SemEval
SIG:: SIGLEX
Publisher:: Association for Computational Linguistics
URL:: https://aclanthology.org/2023.semeval-1
DOI:
Bib Export formats:: BibTeX

pdf bib abs
KnowComp at SemEval-2023 Task 7: Fine-tuning Pre-trained Language Models for Clinical Trial Entailment Identification
Weiqi Wang | Baixuan Xu | Tianqing Fang | Lirong Zhang | Yangqiu Song

In this paper, we present our system for the textual entailment identification task as a subtask of the SemEval-2023 Task 7: Multi-evidence Natural Language Inference for Clinical Trial Data. The entailment identification task aims to determine whether a medical statement affirms a valid entailment given a clinical trial premise or forms a contradiction with it. Since the task is inherently a text classification task, we propose a system that performs binary classification given a statement and its associated clinical trial. Our proposed system leverages a human-defined prompt to aggregate the information contained in the statement, section name, and clinical trials. Pre-trained language models are then finetuned on the prompted input sentences to learn to discriminate the inference relation between the statement and clinical trial. To validate our system, we conduct extensive experiments with a wide variety of pre-trained language models. Our best system is built on DeBERTa-v3-large, which achieves an F1 score of 0.764 and secures the fifth rank in the official leaderboard.Further analysis indicates that leveraging our designed prompt is effective, and our model suffers from a low recall. Our code and pre-trained models are available at [https://github.com/HKUST-KnowComp/NLI4CT](https://github.com/HKUST-KnowComp/NLI4CT).

pdf bib abs
lasigeBioTM at SemEval-2023 Task 7: Improving Natural Language Inference Baseline Systems with Domain Ontologies
Sofia I. R. Conceição | Diana F. Sousa | Pedro Silvestre | Francisco M Couto

Clinical Trials Reports (CTRs) contain highly valuable health information from which Natural Language Inference (NLI) techniques determine if a given hypothesis can be inferred from a given premise. CTRs are abundant with domain terminology with particular terms that are difficult to understand without prior knowledge. Thus, we proposed to use domain ontologies as a source of external knowledge that could help with the inference process in theSemEval-2023 Task 7: Multi-evidence Natural Language Inference for Clinical Trial Data (NLI4CT). This document describes our participation in subtask 1: Textual Entailment, where Ontologies, NLP techniques, such as tokenization and named-entity recognition, and rule-based approaches are all combined in our approach. We were able to show that inputting annotations from domain ontologies improved the baseline systems.

In SemEval-2023 Task 1, a task of applying Word Sense Disambiguation in an image retrieval system was introduced. To resolve this task, this work proposes three approaches: (1) an unsupervised approach considering similarities between word senses and image captions, (2) a supervised approach using a Siamese neural network, and (3) a self-supervised approach using a Bayesian personalized ranking framework. According to the results, both supervised and self-supervised approaches outperformed the unsupervised approach. They can effectively identify correct images of ambiguous words in the dataset provided in this task.

pdf abs
Lexicools at SemEval-2023 Task 10: Sexism Lexicon Construction via XAI
Pakawat Nakwijit | Mahmoud Samir | Matthew Purver

This paper presents our work on the SemEval-2023 Task 10 Explainable Detection of Online Sexism (EDOS) using lexicon-based models. Our approach consists of three main steps: lexicon construction based on Pointwise Mutual Information (PMI) and Shapley value, lexicon augmentation using an unannotated corpus and Large Language Models (LLMs), and, lastly, lexical incorporation for Bag-of-Word (BoW) logistic regression and fine-tuning LLMs. Our results demonstrate that our Shapley approach effectively produces a high-quality lexicon. We also show that by simply counting the presence of certain words in our lexicons and comparing the count can outperform a BoW logistic regression in task B/C and fine-tuning BERT in task C. In the end, our classifier achieved F1-scores of 53.34\% and 27.31\% on the official blind test sets for tasks B and C, respectively. We, additionally, provide in-depth analysis highlighting model limitation and bias. We also present our attempts to understand the model’s behaviour based on our constructed lexicons. Our code and the resulting lexicons are open-sourced in our GitHub repository https://github.com/SirBadr/SemEval2022-Task10.

pdf abs
Augmenters at SemEval-2023 Task 1: Enhancing CLIP in Handling Compositionality and Ambiguity for Zero-Shot Visual WSD through Prompt Augmentation and Text-To-Image Diffusion
Jie Li | Yow-Ting Shiue | Yong-Siang Shih | Jonas Geiping

This paper describes our zero-shot approachesfor the Visual Word Sense Disambiguation(VWSD) Task in English. Our preliminarystudy shows that the simple approach of match-ing candidate images with the phrase usingCLIP suffers from the many-to-many natureof image-text pairs. We find that the CLIP textencoder may have limited abilities in captur-ing the compositionality in natural language. Conversely, the descriptive focus of the phrasevaries from instance to instance. We addressthese issues in our two systems, Augment-CLIPand Stable Diffusion Sampling (SD Sampling).Augment-CLIP augments the text prompt bygenerating sentences that contain the contextphrase with the help of large language mod-els (LLMs). We further explore CLIP modelsin other languages, as the an ambiguous wordmay be translated into an unambiguous one inthe other language. SD Sampling uses text-to-image Stable Diffusion to generate multipleimages from the given phrase, increasing thelikelihood that a subset of images match theone that paired with the text.

We present the findings of SemEval-2023 Task 12, a shared task on sentiment analysis for low-resource African languages using Twitter dataset. The task featured three subtasks; subtask A is monolingual sentiment classification with 12 tracks which are all monolingual languages, subtask B is multilingual sentiment classification using the tracks in subtask A and subtask C is a zero-shot sentiment classification. We present the results and findings of subtask A, subtask B and subtask C. We also release the code on github. Our goal is to leverage low-resource tweet data using pre-trained Afro-xlmr-large, AfriBERTa-Large, Bert-base-arabic-camelbert-da-sentiment (Arabic-camelbert), Multilingual-BERT (mBERT) and BERT models for sentiment analysis of 14 African languages. The datasets for these subtasks consists of a gold standard multi-class labeled Twitter datasets from these languages. Our results demonstrate that Afro-xlmr-large model performed better compared to the other models in most of the languages datasets. Similarly, Nigerian languages: Hausa, Igbo, and Yoruba achieved better performance compared to other languages and this can be attributed to the higher volume of data present in the languages.

pdf abs
BERTastic at SemEval-2023 Task 3: Fine-Tuning Pretrained Multilingual Transformers Does Order Matter?
Tarek Mahmoud | Preslav Nakov

The naive approach for fine-tuning pretrained deep learning models on downstream tasks involves feeding them mini-batches of randomly sampled data. In this paper, we propose a more elaborate method for fine-tuning Pretrained Multilingual Transformers (PMTs) on multilingual data. Inspired by the success of curriculum learning approaches, we investigate the significance of fine-tuning PMTs on multilingual data in a sequential fashion language by language. Unlike the curriculum learning paradigm where the model is presented with increasingly complex examples, we do not adopt a notion of “easy” and “hard” samples. Instead, our experiments draw insight from psychological findings on how the human brain processes new information and the persistence of newly learned concepts. We perform our experiments on a challenging news-framing dataset that contains texts in six languages. Our proposed method outperforms the naïve approach by achieving improvements of 2.57\% in terms of F1 score. Even when we supplement the naïve approach with recency fine-tuning, we still achieve an improvement of 1.34\% with a 3.63\%$ convergence speed-up. Moreover, we are the first to observe an interesting pattern in which deep learning models exhibit a human-like primacy-recency effect.

pdf abs
Brooke-English at SemEval-2023 Task 5: Clickbait Spoiling
Shirui Tang

The task of clickbait spoiling is: generating a short text that satisfies the curiosity induced by a clickbait post. Clickbait links to a web page and advertises its contents by arousing curiosity instead of providing an informative summary. Previous studies on clickbait spoiling has shown the approach that classifing the type of spoilers is needed, then generating the appropriate spoilers is more effective on the Webis Clickbait Spoiling Corpus 2022 dataset. Our contribution focused on study of the three classes (phrase, passage and multi) and finding appropriate models to generate spoilers foreach class. Results were analysed in each type of spoilers, revealed some reasons of having diversed results in different spoiler types. “passage” type spoiler was identified as the most difficult and the most valuable type of spoiler.

pdf abs
Sea_and_Wine at SemEval-2023 Task 9: A Regression Model with Data Augmentation for Multilingual Intimacy Analysis
Yuxi Chen | Yu Chang | Yanqing Tao | Yanru Zhang

In Task 9, we are required to analyze the textual intimacy of tweets in 10 languages. We fine-tune XLM-RoBERTa (XLM-R) pre-trained model to adapt to this multilingual regression task. After tentative experiments, severe class imbalance is observed in the official released dataset, which may compromise the convergence and weaken the model effect. To tackle such challenge, we take measures in two aspects. On the one hand, we implement data augmentation through machine translation to enlarge the scale of classes with fewer samples. On the other hand, we introduce focal mean square error (MSE) loss to emphasize the contributions of hard samples to total loss, thus further mitigating the impact of class imbalance on model effect. Extensive experiments demonstrate remarkable effectiveness of our strategies, and our model achieves high performance on the Pearson’s correlation coefficient (CC) almost above 0.85 on validation dataset.

pdf abs
MarsEclipse at SemEval-2023 Task 3: Multi-lingual and Multi-label Framing Detection with Contrastive Learning
Qisheng Liao | Meiting Lai | Preslav Nakov

This paper describes our system for SemEval-2023 Task 3 Subtask 2 on Framing Detection. We used a multi-label contrastive loss for fine-tuning large pre-trained language models in a multi-lingual setting, achieving very competitive results: our system was ranked first on the official test set and on the official shared task leaderboard for five of the six languages for which we had training data and for which we could perform fine-tuning. Here, we describe our experimental setup, as well as various ablation studies. The code of our system is available at https://github.com/QishengL/SemEval2023.

pdf abs
Mr-Fosdick at SemEval-2023 Task 5: Comparing Dataset Expansion Techniques for Non-Transformer and Transformer Models: Improving Model Performance through Data Augmentation
Christian Falkenberg | Erik Schönwälder | Tom Rietzke | Chris-Andris Görner | Robert Walther | Julius Gonsior | Anja Reusch

In supervised learning, a significant amount of data is essential. To achieve this, we generated and evaluated datasets based on a provided dataset using transformer and non-transformer models. By utilizing these generated datasets during the training of new models, we attain a higher balanced accuracy during validation compared to using only the original dataset.

pdf abs
SafeWebUH at SemEval-2023 Task 11: Learning Annotator Disagreement in Derogatory Text: Comparison of Direct Training vs Aggregation
Sadat Shahriar | Thamar Solorio

Subjectivity and difference of opinion are key social phenomena, and it is crucial to take these into account in the annotation and detection process of derogatory textual content. In this paper, we use four datasets provided by SemEval-2023 Task 11 and fine-tune a BERT model to capture the disagreement in the annotation. We find individual annotator modeling and aggregation lowers the Cross-Entropy score by an average of 0.21, compared to the direct training on the soft labels. Our findings further demonstrate that annotator metadata contributes to the average 0.029 reduction in the Cross-Entropy score.

Our team focuses on the multimodal domain of images and texts, we propose a model that can learn the matching relationship between text-image pairs by contrastive learning. More specifically, We train the model from the labeled data provided by the official organizer, after pre-training, texts are used to reference learned visual concepts enabling visual word sense disambiguation tasks. In addition, the top results our teams get have been released showing the effectiveness of our solution.

pdf abs
MELODI at SemEval-2023 Task 3: In-domain Pre-training for Low-resource Classification of News Articles
Nicolas Devatine | Philippe Muller | Chloé Braud

This paper describes our approach to Subtask 1 “News Genre Categorization” of SemEval-2023 Task 3 “Detecting the Category, the Framing, and the Persuasion Techniques in Online News in a Multi-lingual Setup”, which aims to determine whether a given news article is an opinion piece, an objective report, or satirical. We fine-tuned the domain-specific language model POLITICS, which was pre-trained on a large-scale dataset of more than 3.6M English political news articles following ideology-driven pre-training objectives. In order to use it in the multilingual setup of the task, we added as a pre-processing step the translation of all documents into English. Our system ranked among the top systems overall in most language, and ranked 1st on the English dataset.

This paper describes our system for SemEval-2023 Task 2 Multilingual Complex Named EntityRecognition (MultiCoNER II). Our teamSamsung Research China - Beijing proposesan AL-R (Adjustable Loss RoBERTa) model toboost the performance of recognizing short andcomplex entities with the challenges of longtaildata distribution, out of knowledge base andnoise scenarios. We first employ an adjustabledice loss optimization objective to overcomethe issue of long-tail data distribution, which isalso proved to be noise-robusted, especially incombatting the issue of fine-grained label confusing. Besides, we develop our own knowledgeenhancement tool to provide related contextsfor the short context setting and addressthe issue of out of knowledge base. Experimentshave verified the validation of our approaches.

pdf abs
NLP-LISAC at SemEval-2023 Task 9: Multilingual Tweet Intimacy Analysis via a Transformer-based Approach and Data Augmentation
Abdessamad Benlahbib | Hamza Alami | Achraf Boumhidi | Omar Benslimane

This paper presents our system and findings for SemEval 2023 Task 9 Tweet Intimacy Analysis. The main objective of this task was to predict the intimacy of tweets in 10 languages. Our submitted model (ranked 28/45) consists of a transformer-based approach with data augmentation via machine translation.

pdf abs
Bf3R at SemEval-2023 Task 7: a text similarity model for textual entailment and evidence retrieval in clinical trials and animal studies
Mariana Neves

We describe our participation on the Multi-evidence Natural Language Inference for Clinical Trial Data (NLI4CT) of SemEval’23. The organizers provided a collection of clinical trials as training data and a set of statements, which can be related to either a single trial or to a comparison of two trials. The task consisted of two sub-tasks: (i) textual entailment (Task 1) for predicting whether the statement is supported (Entailment) or not (Contradiction) by the corresponding trial(s); and (ii) evidence retrieval (Task 2) for selecting the evidences (sentences in the trials) that support the decision made for Task 1. We built a model based on a sentence-based BERT similarity model which was pre-trained on ClinicalBERT embeddings. Our best results on the official test sets were f-scores of 0.64 and 0.67 for Tasks 1 and 2, respectively.

pdf abs
University of Hildesheim at SemEval-2023 Task 1: Combining Pre-trained Multimodal and Generative Models for Image Disambiguation
Sebastian Diem | Chan Jong Im | Thomas Mandl

Multimodal ambiguity is a challenge for understanding text and images. Large pre-trained models have reached a high level of quality already. This paper presents an implementation for solving a image disambiguation task relying solely on the knowledge captured in multimodal and language models. Within the task 1 of SemEval 2023 (Visual Word Sense Disambiguation), this approach managed to achieve an MRR of 0.738 using CLIP-Large and the OPT model for generating text. Applying a generative model to create more text given a phrase with an ambiguous word leads to an improvement of our results. The performance gain from a bigger language model is larger than the performance gain from using the lager CLIP model.

pdf abs
LRL_NC at SemEval-2023 Task 4: The Touche23-George-boole Approach for Multi-Label Classification of Human-Values behind Arguments
Kushagri Tandon | Niladri Chatterjee

The task ValueEval aims at assigning a sub- set of possible human value categories under- lying a given argument. Values behind argu- ments are often determinants to evaluate the relevance and importance of decisions in eth- ical sense, thereby making them essential for argument mining. The work presented here proposes two systems for the same. Both sys- tems use RoBERTa to encode sentences in each document. System1 makes use of features ob- tained from training models for two auxiliary tasks, whereas System2 combines RoBERTa with topic modeling to get sentence represen- tation. These features are used by a classifi- cation head to generate predictions. System1 secured the rank 22 in the official task rank- ing, achieving the macro F1-score 0.46 on the main dataset. System2 was not a part of official evaluation. Subsequent experiments achieved highest (among the proposed systems) macro F1-scores of 0.48 (System2), 0.31 (ablation on System1) and 0.33 (ablation on System1) on the main dataset, the Nahj al-Balagha dataset, and the New York Times dataset.

pdf abs
LRL_NC at SemEval-2023 Task 6: Sequential Sentence Classification for Legal Documents Using Topic Modeling Features
Kushagri Tandon | Niladri Chatterjee

Natural Language Processing techniques can be leveraged to process legal proceedings for various downstream applications, such as sum- marization of a given judgement, prediction of the judgement for a given legal case, prece- dent search, among others. These applications will benefit from legal judgement documents already segmented into topically coherent units. The current task, namely, Rhetorical Role Pre- diction, aims at categorising each sentence in the sequence of sentences in a judgement document into different labels. The system proposed in this work combines topic mod- eling and RoBERTa to encode sentences in each document. A BiLSTM layer has been utilised to get contextualised sentence repre- sentations. The Rhetorical Role predictions for each sentence in each document are gen- erated by a final CRF layer of the proposed neuro-computing system. This system secured the rank 12 in the official task ranking, achiev- ing the micro-F1 score 0.7980. The code for the proposed systems has been made available at https://github.com/KushagriT/SemEval23_ LegalEval_TeamLRL_NC

pdf abs
OPI at SemEval-2023 Task 9: A Simple But Effective Approach to Multilingual Tweet Intimacy Analysis
Slawomir Dadas

This paper describes our submission to the SemEval 2023 multilingual tweet intimacy analysis shared task. The goal of the task was to assess the level of intimacy of Twitter posts in ten languages. The proposed approach consists of several steps. First, we perform in-domain pre-training to create a language model adapted to Twitter data. In the next step, we train an ensemble of regression models to expand the training set with pseudo-labeled examples. The extended dataset is used to train the final solution. Our method was ranked first in five out of ten language subtasks, obtaining the highest average score across all languages.

pdf abs
OPI at SemEval-2023 Task 1: Image-Text Embeddings and Multimodal Information Retrieval for Visual Word Sense Disambiguation
Slawomir Dadas

The goal of visual word sense disambiguation is to find the image that best matches the provided description of the word’s meaning. It is a challenging problem, requiring approaches that combine language and image understanding. In this paper, we present our submission to SemEval 2023 visual word sense disambiguation shared task. The proposed system integrates multimodal embeddings, learning to rank methods, and knowledge-based approaches. We build a classifier based on the CLIP model, whose results are enriched with additional information retrieved from Wikipedia and lexical databases. Our solution was ranked third in the multilingual task and won in the Persian track, one of the three language subtasks.

pdf abs
RGAT at SemEval-2023 Task 2: Named Entity Recognition Using Graph Attention Network
Abir Chakraborty

In this paper, we (team RGAT) describe our approach for the SemEval 2023 Task 2: Multilingual Complex Named Entity Recognition (MultiCoNER II). The goal of this task is to locate and classify named entities in unstructured short complex texts in 12 different languages and one multilingual setup. We use the dependency tree of the input query as additional feature in a Graph Attention Network along with the token and part-of-speech features. We also experiment with additional layers like BiLSTM and Transformer in addition to the CRF layer. However, we have not included any external Knowledge base like Wikipedia to enrich our inputs. We evaluated our proposed approach on the English NER dataset that resulted in a clean-subset F1 of 61.29\% and overall F1 of 56.91\%. However, other approaches that used external knowledge base performed significantly better.

pdf abs
eevvgg at SemEval-2023 Task 11: Offensive Language Classification with Rater-based Information
Ewelina Gajewska

A standard majority-based approach to text classification is challenged with an individualised approach in the Semeval-2023 Task 11. Here, disagreements are treated as a useful source of information that could be utilised in the training pipeline. The team proposal makes use of partially disaggregated data and additional information about annotators provided by the organisers to train a BERT-based model for offensive text classification. The approach extends previous studies examining the impact of using raters’ demographic features on classification performance (Hovy, 2015) or training machine learning models on disaggregated data (Davani et al., 2022). The proposed approach was ranked 11 across all 4 datasets, scoring best for cases with a large pool of annotators (6th place in the MD-Agreement dataset) utilising features based on raters’ annotation behaviour.

pdf abs
HULAT at SemEval-2023 Task 9: Data Augmentation for Pre-trained Transformers Applied to Multilingual Tweet Intimacy Analysis
Isabel Segura-Bedmar

This paper describes our participation in SemEval-2023 Task 9, Intimacy Analysis of Multilingual Tweets. We fine-tune some of the most popular transformer models with the training dataset and synthetic data generated by different data augmentation techniques. During the development phase, our best results were obtained by using XLM-T. Data augmentation techniques provide a very slight improvement in the results. Our system ranked in the 27th position out of the 45 participating systems. Despite its modest results, our system shows promising results in languages such as Portuguese, English, and Dutch. All our code is available in the repository https://github.com/isegura/hulat_intimacy.

pdf abs
HULAT at SemEval-2023 Task 10: Data Augmentation for Pre-trained Transformers Applied to the Detection of Sexism in Social Media
Isabel Segura-Bedmar

This paper describes our participation in SemEval-2023 Task 10, whose goal is the detection of sexism in social media. We explore some of the most popular transformer models such as BERT, DistilBERT, RoBERTa, and XLNet. We also study different data augmentation techniques to increase the training dataset. During the development phase, our best results were obtained by using RoBERTa and data augmentation for tasks B and C. However, the use of synthetic data does not improve the results for task C. We participated in the three subtasks. Our approach still has much room for improvement, especially in the two fine-grained classifications. All our code is available in the repository https://github.com/isegura/hulat_edos.

pdf abs
Lauri Ingman at SemEval-2023 Task 4: A Chain Classifier for Identifying Human Values behind Arguments
Spencer Paulissen | Caroline Wendt

Identifying expressions of human values in textual data is a crucial albeit complicated challenge, not least because ethics are highly variable, often implicit, and transcend circumstance. Opinions, arguments, and the like are generally founded upon more than one guiding principle, which are not necessarily independent. As such, little is known about how to classify and predict moral undertones in natural language sequences. Here, we describe and present a solution to ValueEval, our shared contribution to SemEval 2023 Task 4. Our research design focuses on investigating chain classifier architectures with pretrained contextualized embeddings to detect 20 different human values in written arguments. We show that our best model substantially surpasses the classification performance of the baseline method established in prior work. We discuss limitations to our approach and outline promising directions for future work.

pdf abs
NLP-LISAC at SemEval-2023 Task 12: Sentiment Analysis for Tweets expressed in African languages via Transformer-based Models
Abdessamad Benlahbib | Achraf Boumhidi

This paper presents our systems and findings for SemEval-2023 Task 12: AfriSenti-SemEval: Sentiment Analysis for Low-resource African Languages. The main objective of this task was to determine the polarity of a tweet (positive, negative, or neutral). Our submitted models (highest rank is 1 and lowest rank is 21 depending on the target Track) consist of various Transformer-based approaches.

pdf abs
StFX-NLP at SemEval-2023 Task 4: Unsupervised and Supervised Approaches to Detecting Human Values in Arguments
Ethan Heavey | Milton King | James Hughes

In this paper, we discuss our models applied to Task 4: Human Value Detection of SemEval 2023, which incorporated two different embedding techniques to interpret the data. Preliminary experiments were conducted to observe important word types. Subsequently, we explored an XGBoost model, an unsupervised learning model, and two Ensemble learning models were then explored. The best performing model, an ensemble model employing a soft voting technique, secured the 34th spot out of 39 teams, on a class imbalanced dataset. We explored the inclusion of different parts of the provided knowledge resource and found that considering only specific parts assisted our models.

pdf abs
FII SMART at SemEval 2023 Task7: Multi-evidence Natural Language Inference for Clinical Trial Data
Mihai Volosincu | Cosmin Lupu | Diana Trandabat | Daniela Gifu

The “Multi-evidence Natural Language Inference forClinical Trial Data” task at SemEval 2023competition focuses on extracting essentialinformation on clinical trial data, by posing twosubtasks on textual entailment and evidence retrieval. In the context of SemEval, we present a comparisonbetween a method based on the BioBERT model anda CNN model. The task is based on a collection ofbreast cancer Clinical Trial Reports (CTRs),statements, explanations, and labels annotated bydomain expert annotators. We achieved F1 scores of0.69 for determining the inference relation(entailment vs contradiction) between CTR -statement pairs. The implementation of our system ismade available via Github - https://github.com/volosincu/FII_Smart__Semeval2023.

pdf abs
Epicurus at SemEval-2023 Task 4: Improving Prediction of Human Values behind Arguments by Leveraging Their Definitions
Christian Fang | Qixiang Fang | Dong Nguyen

We describe our experiments for SemEval-2023 Task 4 on the identification of human values behind arguments (ValueEval). Because human values are subjective concepts which require precise definitions, we hypothesize that incorporating the definitions of human values (in the form of annotation instructions and validated survey items) during model training can yield better prediction performance. We explore this idea and show that our proposed models perform better than the challenge organizers’ baselines, with improvements in macro F1 scores of up to 18%.

pdf abs
MaChAmp at SemEval-2023 tasks 2, 3, 4, 5, 7, 8, 9, 10, 11, and 12: On the Effectiveness of Intermediate Training on an Uncurated Collection of Datasets.
Rob van der Goot

To improve the ability of language models to handle Natural Language Processing(NLP) tasks and intermediate step of pre-training has recently beenintroduced. In this setup, one takes a pre-trained language model, trains it ona (set of) NLP dataset(s), and then finetunes it for a target task. It isknown that the selection of relevant transfer tasks is important, but recentlysome work has shown substantial performance gains by doing intermediatetraining on a very large set of datasets. Most previous work uses generativelanguage models or only focuses on one or a couple of tasks and uses acarefully curated setup. We compare intermediate training with one or manytasks in a setup where the choice of datasets is more arbitrary; we use allSemEval 2023 text-based tasks. We reach performance improvements for most taskswhen using intermediate training. Gains are higher when doing intermediatetraining on single tasks than all tasks if the right transfer taskis identified. Dataset smoothing and heterogeneous batching did not lead torobust gains in our setup.

pdf abs
UBC-DLNLP at SemEval-2023 Task 12: Impact of Transfer Learning on African Sentiment Analysis
Gagan Bhatia | Ife Adebara | Abdelrahim Elmadany | Muhammad Abdul-mageed

We describe our contribution to the SemEVAl 2023 AfriSenti-SemEval shared task, where we tackle the task of sentiment analysis in 14 different African languages. We develop both monolingual and multilingual models under a full supervised setting (subtasks A and B). We also develop models for the zero-shot setting (subtask C). Our approach involves experimenting with transfer learning using six language models, including further pretraining of some of these models as well as a final finetuning stage. Our best performing models achieve an F1-score of 70.36 on development data and an F1-score of 66.13 on test data. Unsurprisingly, our results demonstrate the effectiveness of transfer learning and finetuning techniques for sentiment analysis across multiple languages. Our approach can be applied to other sentiment analysis tasks in different languages and domains.

pdf abs
PAI at SemEval-2023 Task 4: A General Multi-label Classification System with Class-balanced Loss Function and Ensemble Module
Long Ma | Zeye Sun | Jiawei Jiang | Xuan Li

The Human Value Detection shared task\cite{kiesel:2023} aims to classify whether or not the argument draws on a set of 20 value categories, given a textual argument. This is a difficult task as the discrimination of human values behind arguments is often implicit. Moreover, the number of label categories can be up to 20 and the distribution of data is highly imbalanced. To address these issues, we employ a multi-label classification model and utilize a class-balanced loss function. Our system wins 5 first places, 2 second places, and 6 third places out of 20 categories of the Human Value Detection shared task, and our overall average score of 0.54 also places third. The code is publicly available at \url{https://www.github.com/diqiuzhuanzhuan/semeval2023}.

pdf abs
TüReuth Legal at SemEval-2023 Task 6: Modelling Local and Global Structure of Judgements for Rhetorical Role Prediction
Henrik Manegold | Leander Girrbach

This paper describes our system for SemEval-2023 Task 6: LegalEval: Understanding Legal Texts. We only participate in Sub-Task (A), Predicting Rhetorical Roles. Our final submission achieves 73.35 test set F1 score, ranking 17th of 27 participants. The proposed method combines global and local models of label distributions and transitions between labels. Through our analyses, we show that especially modelling the temporal distribution of labels contributes positively to performance.

pdf abs
nclu_team at SemEval-2023 Task 6: Attention-based Approaches for Large Court Judgement Prediction with Explanation
Nicolay Rusnachenko | Thanet Markchom | Huizhi Liang

Legal documents tend to be large in size. In this paper, we provide an experiment with attention-based approaches complemented by certain document processing techniques for judgment prediction. For the prediction of explanation, we consider this as an extractive text summarization problem based on an output of (1) CNN with attention mechanism and (2) self-attention of language models. Our extensive experiments show that treating document endings at first results in a 2.1% improvement in judgment prediction across all the models. Additional content peeling from non-informative sentences allows an improvement of explanation prediction performance by 4% in the case of attention-based CNN models. The best submissions achieved 8’th and 3’rd ranks on judgment prediction (C1) and prediction with explanation (C2) tasks respectively among 11 participating teams. The results of our experiments are published

pdf abs
TeamUnibo at SemEval-2023 Task 6: A transformer based approach to Rhetorical Roles prediction and NER in Legal Texts
Yuri Noviello | Enrico Pallotta | Flavio Pinzarrone | Giuseppe Tanzi

This study aims to tackle some challenges posed by legal texts in the field of NLP. The LegalEval challenge proposes three tasks, based on Indial Legal documents: Rhetorical Roles Prediction, Legal Named Entity Recognition, and Court Judgement Prediction with Explanation. Our work focuses on the first two tasks. For the first task we present a context-aware approach to enhance sentence information. With the help of this approach, the classification model utilizing InLegalBert as a transformer achieved 81.12% Micro-F1. For the second task we present a NER approach to extract and classify entities like names of petitioner, respondent, court or statute of a given document. The model utilizing XLNet as transformer and a dependency parser on top achieved 87.43% Macro-F1.

pdf abs
UMUTeam at SemEval-2023 Task 12: Ensemble Learning of LLMs applied to Sentiment Analysis for Low-resource African Languages
José Antonio García-Díaz | Camilo Caparros-laiz | Ángela Almela | Gema Alcaráz-Mármol | María José Marín-Pérez | Rafael Valencia-García

These working notes summarize the participation of the UMUTeam in the SemEval 2023 shared task: AfriSenti, focused on Sentiment Analysis in several African languages. Two subtasks are proposed, one in which each language is considered separately and another one in which all languages are merged. Our proposal to solve both subtasks is grounded on the combination of features extracted from several multilingual Large Language Models and a subset of language-independent linguistic features. Our best results are achieved with the African languages less represented in the training set: Xitsonga, a Mozambique dialect, with a weighted f1-score of 54.89\%; Algerian Arabic, with a weighted f1-score of 68.52\%; Swahili, with a weighted f1-score of 60.52\%; and Twi, with a weighted f1-score of 71.14%.

pdf abs
UMUTeam and SINAI at SemEval-2023 Task 9: Multilingual Tweet Intimacy Analysis using Multilingual Large Language Models and Data Augmentation
José Antonio García-Díaz | Ronghao Pan | Salud María Jiménez Zafra | María-Teresa Martn-Valdivia | L. Alfonso Ureña-López | Rafael Valencia-García

This work presents the participation of the UMUTeam and the SINAI research groups in the SemEval-2023 Task 9: Multilingual Tweet Intimacy Analysis. The goal of this task is to predict the intimacy of a set of tweets in 10 languages: English, Spanish, Italian, Portuguese, French, Chinese, Hindi, Arabic, Dutch and Korean, of which, the last 4 are not in the training data. Our approach to address this task is based on data augmentation and the use of three multilingual Large Language Models (multilingual BERT, XLM and mDeBERTA) by ensemble learning. Our team ranked 30th out of 45 participants. Our best results were achieved with two unseen languages: Korean (16th) and Hindi (19th).

pdf abs
Team QUST at SemEval-2023 Task 3: A Comprehensive Study of Monolingual and Multilingual Approaches for Detecting Online News Genre, Framing and Persuasion Techniques
Ye Jiang

This paper describes the participation of team QUST in the SemEval2023 task3. The monolingual models are first evaluated with the under-sampling of the majority classes in the early stage of the task. Then, the pre-trained multilingual model is fine-tuned with a combination of the class weights and the sample weights. Two different fine-tuning strategies, the task-agnostic and the task-dependent, are further investigated. All experiments are conducted under the 10-fold cross-validation, the multilingual approaches are superior to the monolingual ones. The submitted system achieves the second best in Italian and Spanish (zero-shot) in subtask-1.

pdf abs
niceNLP at SemEval-2023 Task 10: Dual Model Alternate Pseudo-labeling Improves Your Predictions
Yu Chang | Yuxi Chen | Yanru Zhang

Sexism is a growing online problem. It harms women who are targeted and makes online spaces inaccessible and unwelcoming. In this paper, we present our approach for Task A of SemEval-2023 Task 10: Explainable Detection of Online Sexism (EDOS), which aims to perform binary sexism detection on textual content. To solve this task, we fine-tune the pre-trained model based on several popular natural language processing methods to improve the generalization ability in the face of different data. According to the experimental results, the effective combination of multiple methods enables our approach to achieve excellent performance gains.

pdf abs
NCUEE-NLP at SemEval-2023 Task 8: Identifying Medical Causal Claims and Extracting PIO Frames Using the Transformer Models
Lung-Hao Lee | Yuan-Hao Cheng | Jen-Hao Yang | Kao-Yuan Tien

This study describes the model design of the NCUEE-NLP system for the SemEval-2023 Task 8. We use the pre-trained transformer models and fine-tune the task datasets to identify medical causal claims and extract population, intervention, and outcome elements in a Reddit post when a claim is given. Our best system submission for the causal claim identification subtask achieved a F1-score of 70.15%. Our best submission for the PIO frame extraction subtask achieved F1-scores of 37.78% for Population class, 43.58% for Intervention class, and 30.67% for Outcome class, resulting in a macro-averaging F1-score of 37.34%. Our system evaluation results ranked second position among all participating teams.

pdf abs
Zhegu at SemEval-2023 Task 9: Exponential Penalty Mean Squared Loss for Multilingual Tweet Intimacy Analysis
Pan He | Yanru Zhang

We present the system description of our team Zhegu in SemEval-2023 Task 9 Multilingual Tweet Intimacy Analysis. We propose \textbf{EPM} (\textbf{E}xponential \textbf{P}enalty \textbf{M}ean Squared Loss) for the purpose of enhancing the ability of learning difficult samples during the training process. Meanwhile, we also apply several methods (frozen Tuning \& contrastive learning based on Language) on the XLM-R multilingual language model for fine-tuning and model ensemble. The results in our experiments provide strong faithful evidence of the effectiveness of our methods. Eventually, we achieved a Pearson score of 0.567 on the test set.

pdf abs
ABCD Team at SemEval-2023 Task 12: An Ensemble Transformer-based System for African Sentiment Analysis
Dang Thin | Dai Nguyen | Dang Qui | Duong Hao | Ngan Nguyen

This paper describes the system of the ABCD team for three main tasks in the SemEval-2023 Task 12: AfriSenti-SemEval for Low-resource African Languages using Twitter Dataset. We focus on exploring the performance of ensemble architectures based on the soft voting technique and different pre-trained transformer-based language models. The experimental results show that our system has achieved competitive performance in some Tracks in Task A: Monolingual Sentiment Analysis, where we rank the Top 3, Top 2, and Top 4 for the Hause, Igbo and Moroccan languages. Besides, our model achieved competitive results and ranked $14ˆ{th}$ place in Task B (multilingual) setting and $14ˆ{th}$ and $8ˆ{th}$ place in Track 17 and Track 18 of Task C (zero-shot) setting.

pdf abs
RIGA at SemEval-2023 Task 2: NER Enhanced with GPT-3
Eduards Mukans | Guntis Barzdins

The following is a description of the RIGA team’s submissions for the English track of the SemEval-2023 Task 2: Multilingual Complex Named Entity Recognition (MultiCoNER) II. Our approach achieves 17% boost in results by utilizing pre-existing Large-scale Language Models (LLMs), such as GPT-3, to gather additional contexts. We then fine-tune a pre-trained neural network utilizing these contexts. The final step of our approach involves meticulous model and compute resource scaling, which results in improved performance. Our results placed us 12th out of 34 teams in terms of overall ranking and 7th in terms of the noisy subset ranking. The code for our method is available on GitHub (https://github.com/emukans/multiconer2-riga).

pdf abs
SUTNLP at SemEval-2023 Task 4: LG-Transformer for Human Value Detection
Hamed Hematian Hemati | Sayed Hesam Alavian | Hossein Sameti | Hamid Beigy

When we interact with other humans, humanvalues guide us to consider the human element. As we shall see, value analysis in NLP hasbeen applied to personality profiling but not toargument mining. As part of SemEval-2023Shared Task 4, our system paper describes amulti-label classifier for identifying human val-ues. Human value detection requires multi-label classification since each argument maycontain multiple values. In this paper, we pro-pose an architecture called Label Graph Trans-former (LG-Transformer). LG-Transformeris a two-stage pipeline consisting of a trans-former jointly encoding argument and labelsand a graph module encoding and obtainingfurther interactions between labels. Using ad-versarial training, we can boost performanceeven further. Our best method scored 50.00 us-ing F1 score on the test set, which is 7.8 higherthan the best baseline method. Our code ispublicly available on Github.

pdf abs
SUTNLP at SemEval-2023 Task 10: RLAT-Transformer for explainable online sexism detection
Hamed Hematian Hemati | Sayed Hesam Alavian | Hamid Beigy | Hossein Sameti

There is no simple definition of sexism, butit can be described as prejudice, stereotyping,or discrimination, especially against women,based on their gender. In online interactions,sexism is common. One out of ten Americanadults says that they have been harassed be-cause of their gender and have been the targetof sexism, so sexism is a growing issue. TheExplainable Detection of Online Sexism sharedtask in SemEval-2023 aims at building sexismdetection systems for the English language. Inorder to address the problem, we use largelanguage models such as RoBERTa and De-BERTa. In addition, we present Random LayerAdversarial Training (RLAT) for transformers,and show its significant impact on solving allsubtasks. Moreover, we use virtual adversar-ial training and contrastive learning to improveperformance on subtask A. Upon completionof subtask A, B, and C test sets, we obtainedmacro-F1 of 84.45, 67.78, and 52.52, respec-tively outperforming proposed baselines on allsubtasks. Our code is publicly available onGithub.

pdf abs
Witcherses at SemEval-2023 Task 12: Ensemble Learning for African Sentiment Analysis
Monil Gokani | K V Aditya Srivatsa | Radhika Mamidi

This paper describes our system submission for SemEval-2023 Task 12 AfriSenti-SemEval: Sentiment Analysis for African Languages. We propose an XGBoost-based ensemble model trained on emoticon frequency-based features and the predictions of several statistical models such as SVMs, Logistic Regression, Random Forests, and BERT-based pre-trained language models such as AfriBERTa and AfroXLMR. We also report results from additional experiments not in the system. Our system achieves a mixed bag of results, achieving a best rank of 7th in three of the languages - Igbo, Twi, and Yoruba.

pdf abs
JCT at SemEval-2023 Tasks 12 A and 12B: Sentiment Analysis for Tweets Written in Low-resource African Languages using Various Machine Learning and Deep Learning Methods, Resampling, and HyperParameter Tuning
Ron Keinan | Yaakov Hacohen-Kerner

In this paper, we describe our submissions to the SemEval-2023 contest. We tackled subtask 12 - “AfriSenti-SemEval: Sentiment Analysis for Low-resource African Languages using Twitter Dataset”. We developed different models for 12 African languages and a 13th model for a multilingual dataset built from these 12 languages. We applied a wide variety of word and char n-grams based on their tf-idf values, 4 classical machine learning methods, 2 deep learning methods, and 3 oversampling methods. We used 12 sentiment lexicons and applied extensive hyperparameter tuning.

pdf abs
IXA at SemEval-2023 Task 2: Baseline Xlm-Roberta-base Approach
Edgar Andres Santamaria

IXA proposes a Sequence labeling fine-tune approach, which consists of a lightweight few-shot baseline (10e), the system takes advantage of transfer learning from pre-trained Named Entity Recognition and cross-lingual knowledge from the LM checkpoint. This technique obtains a drastic reduction in the effective training costs that works as a perfect baseline, future improvements in the baseline approach could fit: 1) Domain adequation, 2) Data augmentation, and 3) Intermediate task learning.

pdf abs
APatt at SemEval-2023 Task 3: The Sapienza NLP System for Ensemble-based Multilingual Propaganda Detection
Antonio Purificato | Roberto Navigli

In this paper, we present our approach to the task of identification of persuasion techniques in text, which is a subtask of the SemEval-2023 Task 3 on the multilingual detection of genre, framing, and persuasion techniques in online news. The subtask is multi-label at the paragraph level and the inventory considered by the organizers covers 23 persuasion techniques. Our solution is based on an ensemble of a variety of pre-trained language models (PLMs) fine-tuned on the propaganda dataset. We first describe our system, the different experimental setups we considered, and then provide the results on the dev and test sets released by the organizers. The official evaluation shows our solution ranks 1st in English and attains high scores in all the other languages, i.e. French, German, Italian, Polish, and Russian. We also perform an extensive analysis of the data and the annotations to investigate how they can influence the quality of our systems.

pdf abs
Foul at SemEval-2023 Task 12: MARBERT Language model and lexical filtering for sentiments analysis of tweets in Algerian Arabic
Faiza Belbachir

This paper describes the system we designed for our participation to SemEval2023 Task 12 Track 6 about Algerian dialect sentiment analysis. We propose a transformer language model approach combined with a lexicon mixing terms and emojis which is used in a post-processing filtering stage. The Algerian sentiment lexicons was extracted manually from tweets. We report on our experiments on Algerian dialect, where we compare the performance of marbert to the one of arabicbert and camelbert on the training and development datasets of Task 12. We also analyse the contribution of our post processing lexical filtering for sentiment analysis. Our system obtained a F1 score equal to 70%, ranking 9th among 30 participants.

pdf abs
CPIC at SemEval-2023 Task 7: GPT2-Based Model for Multi-evidence Natural Language Inference for Clinical Trial Data
Mingtong Huang | Junxiang Ren | Lang Liu | Ruilin Song | Wenbo Yin

This paper describes our system submitted for SemEval Task 7, Multi-Evidence Natural Language Inference for Clinical Trial Data. The task consists of 2 subtasks. Subtask 1 is to determine the relationships between clinical trial data (CTR) and statements. Subtask 2 is to output a set of supporting facts extracted from the premises with the input of CTR premises and statements. Through experiments, we found that our GPT2-based pre-trained models can obtain good results in Subtask 2. Therefore, we use the GPT2-based pre-trained model to fine-tune Subtask 2. We transform the evidence retrieval task into a binary class task by combining premises and statements as input, and the output is whether the premises and statements match. We obtain a top-5 score in the evaluation phase of Subtask 2.

The objective of this shared task is to gain an understanding of legal texts, and it is beset with difficulties such as the comprehension of lengthy noisy legal documents, domain specificity as well as the scarcity of annotated data. To address these challenges, we propose a system that employs a hierarchical model and integrates domain-adaptive pretraining, data augmentation, and auxiliary-task learning techniques. Moreover, to enhance generalization and robustness, we ensemble the models that utilize these diverse techniques. Our system ranked first on the RR sub-task and in the middle for the other two sub-tasks.

pdf abs
StFX NLP at SemEval-2023 Task 1: Multimodal Encoding-based Methods for Visual Word Sense Disambiguation
Yuchen Wei | Milton King

SemEval-2023’s Task 1, Visual Word Sense Disambiguation, a task about text semantics and visual semantics, selecting an image from a list of candidates, that best exhibits a given target word in a small context. We tried several methods, including the image captioning method and CLIP methods, and submitted our predictions in the competition for this task. This paper describes the methods we used and their performance and provides an analysis and discussion of the performance.

pdf abs
VTCC-NER at SemEval-2023 Task 6: An Ensemble Pre-trained Language Models for Named Entity Recognition
Quang-Minh Tran | Xuan-Dung Doan

We propose an ensemble method that combines several pre-trained language models to enhance entity recognition in legal text. Our approach achieved a 90.9873% F1 score on the private test set, ranking 2nd on the leaderboard for SemEval 2023 Task 6, Subtask B - Legal Named Entities Extraction.

pdf abs
Ginn-Khamov at SemEval-2023 Task 6, Subtask B: Legal Named Entities Extraction for Heterogenous Documents
Michael Ginn | Roman Khamov

This paper describes our submission to SemEval-2023 Task 6, Subtask B, a shared task on performing Named Entity Recognition in legal documents for specific legal entity types. Documents are divided into the preamble and judgement texts, and certain entity types should only be tagged in one of the two text sections. To address this challenge, our team proposes a token classification model that is augmented with information about the document type, which achieves greater performance than the non-augmented system.

pdf abs
Mao-Zedong at SemEval-2023 Task 4: Label Represention Multi-Head Attention Model with Contrastive Learning-Enhanced Nearest Neighbor Mechanism for Multi-Label Text Classification
Che Zhang | Ping’an Liu | Zhenyang Xiao | Haojun Fei

This is our system description paper for ValueEval task. The title is:Mao-Zedong At SemEval-2023 Task 4: Label Represention Multi-Head Attention Model With Contrastive Learning-Enhanced Nearest Neighbor Mechanism For Multi-Label Text Classification,and the author is Che Zhang and Pingan Liu and ZhenyangXiao and HaojunFei. In this paper, we propose a model that combinesthe label-specific attention network with the contrastive learning-enhanced nearest neighbor mechanism.

pdf abs
PCJ at SemEval-2023 Task 10: A Ensemble Model Based on Pre-trained Model for Sexism Detection and Classification in English
Chujun Pu | Xiaobing Zhou

This paper describes the system and the resulting model submitted by our team “PCJ” to the SemEval-2023 Task 10 sub-task A contest. In this task, we need to test the English text content in the posts to determine whether there is sexism, which involves emotional text classification. Our submission system utilizes methods based on RoBERTa, SimCSE-RoBERTa pre-training models, and model ensemble to classify and train on datasets provided by the organizers. In the final assessment, our submission achieved a macro average F1 score of 0.8449, ranking 28th out of 84 teams in Task A.

pdf abs
SRCB at SemEval-2023 Task 1: Prompt Based and Cross-Modal Retrieval Enhanced Visual Word Sense Disambiguation
Xudong Zhang | Tiange Zhen | Jing Zhang | Yujin Wang | Song Liu

The Visual Word Sense Disambiguation (VWSD) shared task aims at selecting the image among candidates that best interprets the semantics of a target word with a short-length phrase for English, Italian, and Farsi. The limited phrase context, which only contains 2-3 words, challenges the model’s understanding ability, and the visual label requires image-text matching performance across different modalities. In this paper, we propose a prompt based and multimodal retrieval enhanced VWSD system, which uses the rich potential knowledge of large-scale pretrained models by prompting and additional text-image information from knowledge bases and open datasets. Under the English situation and given an input phrase, (1) the context retrieval module predicts the correct definition from sense inventory by matching phrase and context through a biencoder architecture. (2) The image retrieval module retrieves the relevant images from an image dataset.(3) The matching module decides that either text or image is used to pair with image labels by a rule-based strategy, then ranks the candidate images according to the similarity score. Our system ranks first in the English track and second in the average of all languages (English, Italian, and Farsi).

pdf abs
JUST-KM at SemEval-2023 Task 7: Multi-evidence Natural Language Inference using Role-based Double Roberta-Large
Kefah Alissa | Malak Abdullah

In recent years, there has been a vast increase in the available clinical data. Variant Deep learning techniques are used to enhance the retrieval and interpretation of these data. This task deployed Natural language inference (NLI) in Clinical Trial Reports (CTRs) to provide individualized care that is supported by evidence. A collection of breast cancer clinical trial records, statements, annotations, and labels from experienced domain experts. NLI presents a chance to advance the widespread understanding and retrieval of medical evidence, leading to significant improvements in connecting the most recent evidence to personalized care. The primary objective is to identify the inference relationship (entailment or contradiction) between pairs of clinical trial records and statements. In this research, we used different transformer-based models, and The proposed model, “Role-based Double Roberta-Large,” achieved the best result on the testing dataset with F1-score equal to 67.0%

pdf abs
LLM-RM at SemEval-2023 Task 2: Multilingual Complex NER Using XLM-RoBERTa
Rahul Mehta | Vasudeva Varma

Named Entity Recognition(NER) is a task ofrecognizing entities at a token level in a sen-tence. This paper focuses on solving NER tasksin a multilingual setting for complex named en-tities. Our team, LLM-RM participated in therecently organized SemEval 2023 task, Task 2:MultiCoNER II,Multilingual Complex NamedEntity Recognition. We approach the problemby leveraging cross-lingual representation pro-vided by fine-tuning XLM-Roberta base modelon datasets of all of the 12 languages provided - Bangla, Chinese, English, Farsi, French,German, Hindi, Italian, Portuguese, Spanish,Swedish and Ukrainian.

pdf abs
teamPN at SemEval-2023 Task 1: Visual Word Sense Disambiguation Using Zero-Shot MultiModal Approach
Nikita Katyal | Pawan Rajpoot | Subhanandh Tamilarasu | Joy Mustafi

Visual Word Sense Disambiguation shared task at SemEval-2023 aims to identify an image corresponding to the intended meaning of a given ambiguous word (with related context) from a set of candidate images. The lack of textual description for the candidate image and the corresponding word’s ambiguity makes it a challenging problem. This paper describes teamPN’s multi-modal and modular approach to solving this in English track of the task. We efficiently used recent multi-modal pre-trained models backed by real-time multi-modal knowledge graphs to augment textual knowledge for the images and select the best matching image accordingly. We outperformed the baseline model by ~5 points and proposed a unique approach that can further work as a framework for other modular and knowledge-backed solutions.

pdf abs
LT at SemEval-2023 Task 1: Effective Zero-Shot Visual Word Sense Disambiguation Approaches using External Knowledge Sources
Florian Schneider | Chris Biemann

The objective of the SemEval-2023 Task 1: Visual Word Sense Disambiguation (VWSD) is to identify the image illustrating the indented meaning of a target word and some minimal additional context. The omnipresence of textual and visual data in the task strongly suggests the utilization of the recent advances in multi-modal machine learning, i.e., pretrained visiolinguistic models (VLMs). Often referred to as foundation models due to their strong performance on many vision-language downstream tasks, these models further demonstrate powerful zero-shot capabilities. In this work, we utilize various pertained VLMs in a zero-shot fashion for multiple approaches using external knowledge sources to enrich the contextual information. Further, we evaluate our methods on the final test data and extensively analyze the suitability of different knowledge sources, the influence of training data, model sizes, multi-linguality, and different textual prompting strategies. Although we are not among the best-performing systems (rank 20 of 56), our experiments described in this work prove competitive results. Moreover, we aim to contribute meaningful insights and propel multi-modal machine learning tasks like VWSD.

pdf abs
Coco at SemEval-2023 Task 10: Explainable Detection of Online Sexism
Kangshuai Guo | Ruipeng Ma | Shichao Luo | Yan Wang

Sexism has become a growing concern on social media platforms as it impacts the health of the internet and can have negative impacts on society. This paper describes the coco system that participated in SemEval-2023 Task 10, Explainable Detection of Online Sexism (EDOS), which aims at sexism detection in various settings of natural language understanding. We develop a novel neural framework for sexism detection and misogyny that can combine text representations obtained using pre-trained language model models such as Bidirectional Encoder Representations from Transformers and using BiLSTM architecture to obtain the local and global semantic information. Further, considering that the EDOS dataset is relatively small and extremely unbalanced, we conducted data augmentation and introduced two datasets in the field of sexism detection. Moreover, we introduced Focal Loss which is a loss function in order to improve the performance of processing imbalanced data classification. Our system achieved an F1 score of 78.95\% on Task A - binary sexism.

pdf abs
Diane Simmons at SemEval-2023 Task 5: Is it possible to make good clickbait spoilers using a Zero-Shot approach? Check it out!
Niels Krog | Manex Agirrezabal

In this paper, we present a possible solution to the SemEval23 shared task of generating spoilers for clickbait headlines. Using a Zero-Shot approach with two different Transformer architectures, BLOOM and RoBERTa, we generate three different types of spoilers: phrase, passage and multi. We found, RoBERTa pretrained for Question-Answering to perform better than BLOOM for causal language modelling, however both architectures proved promising for future attempts at such tasks.

pdf abs
OPI PIB at SemEval-2023 Task 1: A CLIP-based Solution Paired with an Additional Word Context Extension
Małgorzata Grębowiec

This article presents our solution for SemEval-2023 Task 1: Visual Word Sense Disambiguation. The aim of the task was to select the most suitable from a list of ten images for a given word, extended by a small textual context. Our solution comprises two parts. The first focuses on an attempt to further extend the textual context, based on word definitions contained in WordNet and in Open English WordNet. The second focuses on selecting the most suitable image using the CLIP model with previously developed word context and additional information obtained from the BEiT image classification model. Our solution allowed us to achieve a result of 70.84% on the official test dataset for the English language.

pdf abs
NLNDE at SemEval-2023 Task 12: Adaptive Pretraining and Source Language Selection for Low-Resource Multilingual Sentiment Analysis
Mingyang Wang | Heike Adel | Lukas Lange | Jannik Strötgen | Hinrich Schütze

This paper describes our system developed for the SemEval-2023 Task 12 “Sentiment Analysis for Low-resource African Languages using Twitter Dataset”. Sentiment analysis is one of the most widely studied applications in natural language processing. However, most prior work still focuses on a small number of high-resource languages. Building reliable sentiment analysis systems for low-resource languages remains challenging, due to the limited training data in this task. In this work, we propose to leverage language-adaptive and task-adaptive pretraining on African texts and study transfer learning with source language selection on top of an African language-centric pretrained language model. Our key findings are: (1) Adapting the pretrained model to the target language and task using a small yet relevant corpus improves performance remarkably by more than 10 F1 score points. (2) Selecting source languages with positive transfer gains during training can avoid harmful interference from dissimilar languages, leading to better results in multilingual and cross-lingual settings. In the shared task, our system wins 8 out of 15 tracks and, in particular, performs best in the multilingual evaluation.

pdf abs
IUST_NLP at SemEval-2023 Task 10: Explainable Detecting Sexism with Transformers and Task-adaptive Pretraining
Hadiseh Mahmoudi

This paper describes our system on SemEval-2023 Task 10: Explainable Detection of Online Sexism (EDOS). This work aims to design an automatic system for detecting and classifying sexist content in online spaces. We propose a set of transformer-based pre-trained models with task-adaptive pretraining and ensemble learning. The main contributions of our system include analyzing the performance of different transformer-based pre-trained models and combining these models, as well as providing an efficient method using large amounts of unlabeled data for model adaptive pretraining. We have also explored several other strategies. On the test dataset, our system achieves F1-scores of 83%, 64%, and 47% on subtasks A, B, and C, respectively.

pdf abs
TAM of SCNU at SemEval-2023 Task 1: FCLL: A Fine-grained Contrastive Language-Image Learning Model for Cross-language Visual Word Sense Disambiguation
Qihao Yang | Yong Li | Xuelin Wang | Shunhao Li | Tianyong Hao

Visual Word Sense Disambiguation (WSD), as a fine-grained image-text retrieval task, aims to identify the images that are relevant to ambiguous target words or phrases. However, the difficulties of limited contextual information and cross-linguistic background knowledge in text processing make this task challenging. To alleviate this issue, we propose a Fine-grained Contrastive Language-Image Learning (FCLL) model, which learns fine-grained image-text knowledge by employing a new fine-grained contrastive learning mechanism and enriches contextual information by establishing relationship between concepts and sentences. In addition, a new multimodal-multilingual knowledge base involving ambiguous target words is constructed for visual WSD. Experiment results on the benchmark datasets from SemEval-2023 Task 1 show that our FCLL ranks at the first in overall evaluation with an average H@1 of 72.56\% and an average MRR of 82.22\%. The results demonstrate that FCLL is effective in inference on fine-grained language-vision knowledge. Source codes and the knowledge base are publicly available at https://github.com/CharlesYang030/FCLL.

pdf abs
Sefamerve at SemEval-2023 Task 12: Semantic Evaluation of Rarely Studied Languages
Selman Delil | Birol Kuyumcu

This paper describes our contribution to SemEval-23 Shared Task 12: ArfiSenti. The task consists of several sentiment classification subtasks for rarely studied African languages to predict positive, negative, or neutral classes of a given Twitter dataset. In our system we utilized three different models; FastText, MultiLang Transformers, and Language-Specific Transformers to find the best working model for the classification challenge. We experimented with mentioned models and mostly reached the best prediction scores using the Language Specific Transformers. Our best-submitted result was ranked 3rd among submissions for the Amharic language, obtaining an F1 score of 0.702 behind the second-ranked system.

pdf abs
TeamShakespeare at SemEval-2023 Task 6: Understand Legal Documents with Contextualized Large Language Models
Xin Jin | Yuchen Wang

The growth of pending legal cases in populouscountries, such as India, has become a major is-sue. Developing effective techniques to processand understand legal documents is extremelyuseful in resolving this problem. In this pa-per, we present our systems for SemEval-2023Task 6: understanding legal texts (Modi et al., 2023). Specifically, we first develop the Legal-BERT-HSLN model that considers the com-prehensive context information in both intra-and inter-sentence levels to predict rhetoricalroles (subtask A) and then train a Legal-LUKEmodel, which is legal-contextualized and entity-aware, to recognize legal entities (subtask B).Our evaluations demonstrate that our designedmodels are more accurate than baselines, e.g.,with an up to 15.0% better F1 score in subtaskB. We achieved notable performance in the taskleaderboard, e.g., 0.834 micro F1 score, andranked No.5 out of 27 teams in subtask A.

pdf abs
JUST_ONE at SemEval-2023 Task 10: Explainable Detection of Online Sexism (EDOS)
Doaa Obeidat | Wala’a Shnaigat | Heba Nammas | Malak Abdullah

The problem of online sexism, which refers to offensive content targeting women based on their gender or the intersection of their gender with one or more additional identity characteristics, such as race or religion, has become a widespread phenomenon on social media. This can include sexist comments and memes. To address this issue, the SemEval-2023 international workshop introduced the “Explainable Detection of Online Sexism Challenge”, which aims to explain the classifications given by AI models for detecting sexism. In this paper, we present the contributions of our team, JUSTONE, to all three sub-tasks of the challenge: subtask A, a binary classification task; subtask B, a four-class classification task; and subtask C, a fine-grained classification task. To accomplish this, we utilized pre-trained language models, specifically BERT and RoBERTa from Hugging Face, and a selective ensemble method in task 10 of the SemEval 2023 competition. As a result, our team achieved the following rankings and scores in different tasks: 19th out of 84 with a Macro-F1 score of 0.8538 in task A, 22nd out of 69 with a Macro-F1 score of 0.6417 in task B, and 14th out of 63 with a Macro-F1 score of 0.4774 in task C.

pdf abs
Adam-Smith at SemEval-2023 Task 4: Discovering Human Values in Arguments with Ensembles of Transformer-based Models
Daniel Schroter | Daryna Dementieva | Georg Groh

This paper presents the best-performing approach alias “Adam Smith” for the SemEval-2023 Task 4: “Identification of Human Values behind Arguments”. The goal of the task was to create systems that automatically identify the values within textual arguments. We train transformer-based models until they reach their loss minimum or f1-score maximum. Ensembling the models by selecting one global decision threshold that maximizes the f1-score leads to the best-performing system in the competition. Ensembling based on stacking with logistic regressions shows the best performance on an additional dataset provided to evaluate the robustness (“Nahj al-Balagha”). Apart from outlining the submitted system, we demonstrate that the use of the large ensemble model is not necessary and that the system size can be significantly reduced.

pdf abs
Andronicus of Rhodes at SemEval-2023 Task 4: Transformer-Based Human Value Detection Using Four Different Neural Network Architectures
Georgios Papadopoulos | Marko Kokol | Maria Dagioglou | Georgios Petasis

This paper presents our participation to the “Human Value Detection shared task (Kiesel et al., 2023), as “Andronicus of Rhodes. We describe the approaches behind each entry in the official evaluation, along with the motivation behind each approach. Our best-performing approach has been based on BERT large, with 4 classification heads, implementing two different classification approaches (with different activation and loss functions), and two different partitioning of the training data, to handle class imbalance. Classification is performed through majority voting. The proposed approach outperforms the BERT baseline, ranking in the upper half of the competition.

pdf abs
FTD at SemEval-2023 Task 3: News Genre and Propaganda Detection by Comparing Mono- and Multilingual Models with Fine-tuning on Additional Data
Mikhail Lepekhin | Serge Sharoff

We report our participation in the SemEval-2023 shared task on propaganda detection and describe our solutions with pre-trained models and their ensembles. For Subtask 1 (News Genre Categorisation), we report the impact of several settings, such as the choice of the classification models (monolingual or multilingual or their ensembles), the choice of the training sets (base or additional sources), the impact of detection certainty in making a classification decision as well as the impact of other hyper-parameters. In particular, we fine-tune models on additional data for other genre classification tasks, such as FTD. We also try adding texts from genre-homogenous corpora, such as Panorama, Babylon Bee for satire and Giganews for for reporting texts. We also make prepared models for Subtasks 2 and 3 with finetuning the corresponding models first for Subtask 1.The code needed to reproduce the experiments is available.

pdf abs
MIND at SemEval-2023 Task 11: From Uncertain Predictions to Subjective Disagreement
Giulia Rizzi | Alessandro Astorino | Daniel Scalena | Paolo Rosso | Elisabetta Fersini

This paper describes the participation of the research laboratory MIND, at the University of Milano-Bicocca, in the SemEval 2023 task related to Learning With Disagreements (Le-Wi-Di). The main goal is to identify the level of agreement/disagreement from a collection of textual datasets with different characteristics in terms of style, language and task. The proposed approach is grounded on the hypothesis that the disagreement between annotators could be grasped by the uncertainty that a model, based on several linguistic characteristics, could have on the prediction of a given gold label.

pdf abs
Sartipi-Sedighin at SemEval-2023 Task 2: Fine-grained Named Entity Recognition with Pre-trained Contextual Language Models and Data Augmentation from Wikipedia
Amir Sartipi | Amirreza Sedighin | Afsaneh Fatemi | Hamidreza Baradaran Kashani

This paper presents the system developed by the Sartipi-Sedighin team for SemEval 2023 Task 2, which is a shared task focused on multilingual complex named entity recognition (NER), or MultiCoNER II. The goal of this task is to identify and classify complex named entities (NEs) in text across multiple languages. To tackle the MultiCoNER II task, we leveraged pre-trained language models (PLMs) fine-tuned for each language included in the dataset. In addition, we also applied a data augmentation technique to increase the amount of training data available to our models. Specifically, we searched for relevant NEs that already existed in the training data within Wikipedia, and we added new instances of these entities to our training corpus. Our team achieved an overall F1 score of 61.25% in the English track and 71.79% in the multilingual track across all 13 tracks of the shared task that we submitted to.

pdf abs
uOttawa at SemEval-2023 Task 6: Deep Learning for Legal Text Understanding
Intisar Almuslim | Sean Stilwell | Surya Kiran Suresh | Diana Inkpen

We describe the methods we used for legal text understanding, specifically Task 6 Legal-Eval at SemEval 2023. The outcomes could assist law practitioners and help automate the working process of judicial systems. The shared task defined three main sub-tasks: sub-task A, Rhetorical Roles Prediction (RR); sub-task B, Legal Named Entities Extraction (L-NER); and sub-task C, Court Judgement Prediction with Explanation (CJPE). Our team addressed all three sub-tasks by exploring various Deep Learning (DL) based models. Overall, our team’s approaches achieved promising results on all three sub-tasks, demonstrating the potential of deep learning-based models in the judicial domain.

pdf abs
UMUTeam at SemEval-2023 Task 10: Fine-grained detection of sexism in English
Ronghao Pan | José Antonio García-Díaz | Salud María Jiménez Zafra | Rafael Valencia-García

In this manuscript, we describe the participation of UMUTeam in the Explainable Detection of Online Sexism shared task proposed at SemEval 2023. This task concerns the precise and explainable detection of sexist content on Gab and Reddit, i.e., developing detailed classifiers that not only identify what is sexist, but also explain why it is sexism. Our participation in the three EDOS subtasks is based on extending new unlabeled sexism data in the Masked Language Model task of a pre-trained model, such as RoBERTa-large to improve its generalization capacity and its performance on classification tasks. Once the model has been pre-trained with the new data, fine-tuning of this model is performed for different specific sexism classification tasks. Our system has achieved excellent results in this competitive task, reaching top 24 (84) in Task A, top 23 (69) in Task B, and top 13 (63) in Task C.

pdf abs
NLP_CHRISTINE at SemEval-2023 Task 10: Utilizing Transformer Contextual Representations and Ensemble Learning for Sexism Detection on Social Media Texts
Christina Christodoulou

The paper describes the SemEval-2023 Task 10: “Explainable Detection of Online Sexism (EDOS)”, which investigates the detection of sexism on two social media sites, Gab and Reddit, by encouraging the development of machine learning models that perform binary and multi-class classification on English texts. The EDOS Task consisted of three hierarchical sub-tasks: binary sexism detection in sub-task A, category of sexism detection in sub-task B and fine-grained vector of sexism detection in sub-task C. My participation in EDOS comprised fine-tuning of different layer representations of Transformer-based pre-trained language models, namely BERT, AlBERT and RoBERTa, and ensemble learning via majority voting of the best performing models. Despite the low rank mainly due to a submission error, the system employed the largest version of the aforementioned Transformer models (BERT-Large, ALBERT-XXLarge-v1, ALBERT-XXLarge-v2, RoBERTa-Large), experimented with their multi-layer structure and aggregated their predictions so as to get the final result. My predictions on the test sets achieved 82.88%, 63.77% and 43.08% Macro-F1 score in sub-tasks A, B and C respectively.

pdf abs
T.M. Scanlon at SemEval-2023 Task 4: Leveraging Pretrained Language Models for Human Value Argument Mining with Contrastive Learning
Milad Molazadeh Oskuee | Mostafa Rahgouy | Hamed Babaei Giglou | Cheryl D Seals

Human values are of great concern to social sciences which refer to when people have different beliefs and priorities of what is generally worth striving for and how to do so. This paper presents an approach for human value argument mining using contrastive learning to leverage the isotropy of language models. We fine-tuned DeBERTa-Large in a multi-label classification fashion and achieved an F1 score of 49% for the task, resulting in a rank of 11. Our proposed model provides a valuable tool for analyzing arguments related to human values and highlights the significance of leveraging the isotropy of large language models for identifying human values.

pdf abs
UMUTeam at SemEval-2023 Task 3: Multilingual transformer-based model for detecting the Genre, the Framing, and the Persuasion Techniques in Online News
Ronghao Pan | José Antonio García-Díaz | Miguel Ángel Rodríguez-García | Rafael Valencia-García

In this manuscript, we describe the participation of the UMUTeam in SemEval-2023 Task 3, a shared task on detecting different aspects of news articles and other web documents, such as document category, framing dimensions, and persuasion technique in a multilingual setup. The task has been organized into three related subtasks, and we have been involved in the first two. Our approach is based on a fine-tuned multilingual transformer-based model that uses the dataset of all languages at once and a sentence transformer model to extract the most relevant chunk of a text for subtasks 1 and 2. The input data was truncated to 200 tokens with 50 overlaps using the sentence-transformer model to obtain the subset of text most related to the articles’ titles. Our system has performed good results in subtask 1 in most languages, and in some cases, such as French and German, we have archived first place in the official leader board. As for task 2, our system has also performed very well in all languages, ranking in all the top 10.

pdf abs
Appeal for Attention at SemEval-2023 Task 3: Data augmentation extension strategies for detection of online news persuasion techniques
Sergiu Amihaesei | Laura Cornei | George Stoica

In this paper, we proposed and explored the impact of four different dataset augmentation andextension strategies that we used for solving the subtask 3 of SemEval-2023 Task 3: multi-label persuasion techniques classification in a multi-lingual context. We consider two types of augmentation methods (one based on a modified version of synonym replacement and one based on translations) and two ways of extending the training dataset (using filtered data generated by GPT-3 and using a dataset from a previous competition). We studied the effects of the aforementioned techniques by using theaugmented and/or extended training dataset to fine-tune a pretrained XLM-RoBERTa-Large model. Using the augmentation methods alone, we managed to obtain 3rd place for English, 13th place for Italian and between the 5th to 9th places for the other 7 languages during the competition.

pdf abs
Chick Adams at SemEval-2023 Task 5: Using RoBERTa and DeBERTa to Extract Post and Document-based Features for Clickbait Spoiling
Ronghao Pan | José Antonio García-Díaz | Franciso García-Sánchez | Rafael Valencia-García

In this manuscript, we describe the participation of the UMUTeam in SemEval-2023 Task 5, namely, Clickbait Spoiling, a shared task on identifying spoiler type (i.e., a phrase or a passage) and generating short texts that satisfy curiosity induced by a clickbait post, i.e. generating spoilers for the clickbait post. Our participation in Task 1 is based on fine-tuning pre-trained models, which consists in taking a pre-trained model and tuning it to fit the spoiler classification task. Our system has obtained excellent results in Task 1: we outperformed all proposed baselines, being within the Top 10 for most measures. Foremost, we reached Top 3 in F1 score in the passage spoiler ranking.

pdf abs
KInITVeraAI at SemEval-2023 Task 3: Simple yet Powerful Multilingual Fine-Tuning for Persuasion Techniques Detection
Timo Hromadka | Timotej Smolen | Tomas Remis | Branislav Pecher | Ivan Srba

This paper presents the best-performing solution to the SemEval 2023 Task 3 on the subtask 3 dedicated to persuasion techniques detection. Due to a high multilingual character of the input data and a large number of 23 predicted labels (causing a lack of labelled data for some language-label combinations), we opted for fine-tuning pre-trained transformer-based language models. Conducting multiple experiments, we find the best configuration, which consists of large multilingual model (XLM-RoBERTa large) trained jointly on all input data, with carefully calibrated confidence thresholds for seen and surprise languages separately. Our final system performed the best on 6 out of 9 languages (including two surprise languages) and achieved highly competitive results on the remaining three languages.

pdf abs
jelenasteam at SemEval-2023 Task 9: Quantification of Intimacy in Multilingual Tweets using Machine Learning Algorithms: A Comparative Study on the MINT Dataset
Jelena Lazić | Sanja Vujnović

Intimacy is one of the fundamental aspects of our social life. It relates to intimate interactions with others, often including verbal self-disclosure. In this paper, we researched machine learning algorithms for quantification of the intimacy in the tweets. A new multilingual textual intimacy dataset named MINT was used. It contains tweets in 10 languages, including English, Spanish, French, Portuguese, Italian, and Chinese in both training and test datasets, and Dutch, Korean, Hindi, and Arabic in test data only. In the first experiment, linear regression models combine with the features and word embedding, and XLM-T deep learning model were compared. In the second experiment, cross-lingual learning between languanges was tested. In the third experiments, data was clustered using K-means. The results indicate that XLM-T pre-trained embedding might be a good choice for an unsupervised learning algorithm for intimacy detection.

pdf abs
UL & UM6P at SemEval-2023 Task 10: Semi-Supervised Multi-task Learning for Explainable Detection of Online Sexism
Salima Lamsiyah | Abdelkader El Mahdaouy | Hamza Alami | Ismail Berrada | Christoph Schommer

This paper introduces our participating system to the Explainable Detection of Online Sexism (EDOS) SemEval-2023 - Task 10: Explainable Detection of Online Sexism. The EDOS shared task covers three hierarchical sub-tasks for sexism detection, coarse-grained and fine-grained categorization. We have investigated both single-task and multi-task learning based on RoBERTa transformer-based language models. For improving the results, we have performed further pre-training of RoBERTa on the provided unlabeled data. Besides, we have employed a small sample of the unlabeled data for semi-supervised learning using the minimum class-confusion loss. Our system has achieved macro F1 scores of 82.25\%, 67.35\%, and 49.8\% on Tasks A, B, and C, respectively.

This paper describes the system developed by the USTC-NELSLIP team for SemEval-2023 Task 2 Multilingual Complex Named Entity Recognition (MultiCoNER II). We propose a method named Statistical Construction and Dual Adaptation of Gazetteer (SCDAG) for Multilingual Complex NER. The method first utilizes a statistics-based approach to construct a gazetteer. Secondly, the representations of gazetteer networks and language models are adapted by minimizing the KL divergence between them at the sentence-level and entity-level. Finally, these two networks are then integrated for supervised named entity recognition (NER) training. The proposed method is applied to several state-of-the-art Transformer-based NER models with a gazetteer built from Wikidata, and shows great generalization ability across them. The final predictions are derived from an ensemble of these trained models. Experimental results and detailed analysis verify the effectiveness of the proposed method. The official results show that our system ranked 1st on one track (Hindi) in this task.

pdf abs
Rudolf Christoph Eucken at SemEval-2023 Task 4: An Ensemble Approach for Identifying Human Values from Arguments
Sougata Saha | Rohini Srihari

The subtle human values we acquire through life experiences govern our thoughts and gets reflected in our speech. It plays an integral part in capturing the essence of our individuality and making it imperative to identify such values in computational systems that mimic human actions. Computational argumentation is a field that deals with the argumentation capabilities of humans and can benefit from identifying such values. Motivated by that, we present an ensemble approach for detecting human values from argument text. Our ensemble comprises three models: (i) An entailment-based model for determining the human values based on their descriptions, (ii) A Roberta-based classifier that predicts the set of human values from an argument. (iii) A Roberta-based classifier to predict a reduced set of human values from an argument. We experiment with different ways of combining the models and report our results. Furthermore, our best combination achieves an overall F1 score of 0.48 on the main test set.

pdf abs
YNU-HPCC at SemEval-2023 Task7: Multi-evidence Natural Language Inference for Clinical Trial Data Based a BioBERT Model
Chao Feng | Jin Wang | Xuejie Zhang

This paper describes the system for the YNU-HPCC team in subtask 1 of the SemEval-2023 Task 7: Multi-evidence Natural Language Inference for Clinical Trial Data (NLI4CT). This task requires judging the textual entailment relationship between the given CTR and the statement annotated by the expert annotator. This system is based on the fine-tuned Bi-directional Encoder Representation from Transformers for Biomedical Text Mining (BioBERT) model with supervised contrastive learning and back translation. Supervised contrastive learning is to enhance the classification, and back translation is to enhance the training data. Our system achieved relatively good results on the competition’s official leaderboard. The code of this paper is available at https://github.com/facanhe/SemEval-2023-Task7.

pdf abs
SRCB at SemEval-2023 Task 2: A System of Complex Named Entity Recognition with External Knowledge
Yuming Zhang | Hongyu Li | Yongwei Zhang | Shanshan Jiang | Bin Dong

The MultiCoNER II shared task aims at detecting semantically ambiguous and complex named entities in short and low-context settings for multiple languages. The lack of context makes the recognition of ambiguous named entities challenging. To alleviate this issue, our team SRCB proposes an external knowledge based system, where we utilize 3 different types of external knowledge retrieved in different ways. Given an original text, our system retrieves the possible labels and the descriptions for each potential entity detected by a mention detection model. And we also retrieve a related document as extra context from Wikipedia for each original text. We concatenate the original text with the external knowledge as the input of NER models. The informative contextual representations with external knowledge significantly improve the NER performance in both Chinese and English tracks. Our system win the 3rd place in the Chinese track and the 6th place in the English track.

This paper describes our system used in the SemEval-2023 Task12: Sentiment Analysis for Low-resource African Languages using Twit- ter Dataset (Muhammad et al., 2023c). The AfriSenti-SemEval Shared Task 12 is based on a collection of Twitter datasets in 14 African languages for sentiment classification. It con- sists of three sub-tasks. Task A is a monolin- gual sentiment classification which covered 12 African languages. Task B is a multilingual sen- timent classification which combined training data from Task A (12 African languages). Task C is a zero-shot sentiment classification. We uti- lized various strategies, including monolingual training, multilingual mixed training, and trans- lation technology, and proposed a weighted vot- ing method that combined the results of differ- ent strategies. Substantially, in the monolingual subtask, our system achieved Top-1 in two lan- guages (Yoruba and Twi) and Top-2 in four languages (Nigerian Pidgin, Algerian Arabic, and Swahili, Multilingual). In the multilingual subtask, Our system achived Top-2 in publish leaderBoard.

pdf abs
IRIT_IRIS_C at SemEval-2023 Task 6: A Multi-level Encoder-based Architecture for Judgement Prediction of Legal Cases and their Explanation
Nishchal Prasad | Mohand Boughanem | Taoufiq Dkaki

This paper describes our system used for sub-task C (1 & 2) in Task 6: LegalEval: Understanding Legal Texts. We propose a three-level encoder-based classification architecture that works by fine-tuning a BERT-based pre-trained encoder, and post-processing the embeddings extracted from its last layers, using transformer encoder layers and RNNs. We run ablation studies on the same and analyze itsperformance. To extract the explanations for the predicted class we develop an explanation extraction algorithm, exploiting the idea of a model’s occlusion sensitivity. We explored some training strategies with a detailed analysis of the dataset. Our system ranks 2nd (macro-F1 metric) for its sub-task C-1 and 7th (ROUGE-2 metric) for sub-task C-2.

pdf abs
Walter Burns at SemEval-2023 Task 5: NLP-CIMAT - Leveraging Model Ensembles for Clickbait Spoiling
Emilio Villa Cueva | Daniel Vallejo Aldana | Fernando Sánchez Vega | Adrián Pastor López Monroy

This paper describes our participation in the Clickbait challenge at SemEval 2023. In this work, we address the Clickbait classification task using transformers models in an ensemble configuration. We tackle the Spoiler Generation task using a two-level ensemble strategy of models trained for extractive QA, and selecting the best K candidates for multi-part spoilers. In the test partitions, our approaches obtained a classification accuracy of 0.716 for classification and a BLEU-4 score of 0.439 for spoiler generation.

pdf abs
Team INF-UFRGS at SemEval-2023 Task 7: Supervised Contrastive Learning for Pair-level Sentence Classification and Evidence Retrieval
Abel Corrêa Dias | Filipe Dias | Higor Moreira | Viviane Moreira | João Luiz Comba

This paper describes the EvidenceSCL system submitted by our team (INF-UFRGS) to SemEval-2023 Task 7: Multi-Evidence Natural Language Inference for Clinical Trial Data (NLI4CT). NLI4CT is divided into two tasks, one for determining the inference relation between a pair of statements in clinical trials and a second for retrieving a set of supporting facts from the premises necessary to justify the label predicted in the first task. Our approach uses pair-level supervised contrastive learning to classify pairs of sentences. We trained EvidenceSCL on two datasets created from NLI4CT and additional data from other NLI datasets. We show that our approach can address both goals of NLI4CT, and although it reached an intermediate position, there is room for improvement in the technique.

pdf abs
AU_NLP at SemEval-2023 Task 10: Explainable Detection of Online Sexism Using Fine-tuned RoBERTa
Amit Das | Nilanjana Raychawdhary | Tathagata Bhattacharya | Gerry Dozier | Cheryl D. Seals

Social media is a concept developed to link people and make the globe smaller. But it has recently developed into a center for sexist memes that target especially women. As a result, there are more events of hostile actions and harassing remarks present online. In this paper, we introduce our system for the task of online sexism detection, a part of SemEval 2023 task 10. We introduce fine-tuned RoBERTa model to address this specific problem. The efficiency of the proposed strategy is demonstrated by the experimental results reported in this research.

pdf abs
KINLP at SemEval-2023 Task 12: Kinyarwanda Tweet Sentiment Analysis
Antoine Nzeyimana

This paper describes the system entered by the author to the SemEval-2023 Task 12: Sentiment analysis for African languages. The system focuses on the Kinyarwanda language and uses a language-specific model. Kinyarwanda morphology is modeled in a two tier transformer architecture and the transformer model is pre-trained on a large text corpus using multi-task masked morphology prediction. The model is deployed on an experimental platform that allows users to experiment with the pre-trained language model fine-tuning without the need to write machine learning code. Our final submission to the shared task achieves second ranking out of 34 teams in the competition, achieving 72.50% weighted F1 score. Our analysis of the evaluation results highlights challenges in achieving high accuracy on the task and identifies areas for improvement.

pdf abs
ACSMKRHR at SemEval-2023 Task 10: Explainable Online Sexism Detection(EDOS)
Rakib Hossain Rifat | Abanti Shruti | Marufa Kamal | Farig Sadeque

People are expressing their opinions online for a lot of years now. Although these opinions and comments provide people an opportunity of expressing their views, there is a lot of hate speech that can be found online. More specifically, sexist comments are very popular affecting and creating a negative impact on a lot of women and girls online. This paper describes the approaches of the SemEval-2023 Task 10 competition for Explainable Online Sexism Detection (EDOS). The task has been divided into 3 subtasks, introducing different classes of sexist comments. We have approached these tasks using the bert-cased and uncased models which are trained on the annotated dataset that has been provided in the competition. Task A provided the best F1 score of 80% on the test set, and tasks B and C provided 58% and 40% respectively.

pdf abs
YNU-HPCC at SemEval-2023 Task 9: Pretrained Language Model for Multilingual Tweet Intimacy Analysis
Qisheng Cai | Jin Wang | Xuejie Zhang

This paper describes our fine-tuned pretrained language model for task 9 (Multilingual Tweet Intimacy Analysis, MTIA) of the SemEval 2023 competition. MTIA aims to quantitatively analyze tweets in 6 languages for intimacy, giving a score from 1 to 5. The challenge of MTIA is in semantically extracting information from code-mixed texts. To alleviate this difficulty, we suggested a solution that combines attention and memory mechanisms. The preprocessed tweets are input to the XLM-T layer to get sentence embeddings and subsequently to the bidirectional GRU layer to obtain intimacy ratings. Experimental results show an improvement in the overall performance of our model in both seen and unseen languages.

pdf abs
JCT_DM at SemEval-2023 Task 10: Detection of Online Sexism: from Classical Models to Transformers
Efrat Luzzon | Chaya Liebeskind

This paper presents the experimentation of systems for detecting online sexism relying on classical models, deep learning models, and transformer-based models. The systems aim to provide a comprehensive approach to handling the intricacies of online language, including slang and neologisms. The dataset consists of labeled and unlabeled data from Gab and Reddit, which allows for the development of unsupervised or semi-supervised models. The system utilizes TF-IDF with classical models, bidirectional models with embedding, and pre-trained transformer models. The paper discusses the experimental setup and results, demonstrating the effectiveness of the system in detecting online sexism.

The MultiCoNER II task aims to detect complex, ambiguous, and fine-grained named entities in low-context situations and noisy scenarios like the presence of spelling mistakes and typos for multiple languages. The task poses significant challenges due to the scarcity of contextual information, the high granularity of the entities(up to 33 classes), and the interference of noisy data. To address these issues, our team PAI proposes a universal Named Entity Recognition (NER) system that integrates external entity information to improve performance. Specifically, our system retrieves entities with properties from the knowledge base (i.e. Wikipedia) for a given text, then concatenates entity information with the input sentence and feeds it into Transformer-based models. Finally, our system wins 2 first places, 4 second places, and 1 third place out of 13 tracks. The code is publicly available at https://github.com/diqiuzhuanzhuan/semeval-2023.

pdf abs
NITS_Legal at SemEval-2023 Task 6: Rhetorical Roles Prediction of Indian Legal Documents via Sentence Sequence Labeling Approach
Deepali Jain | Malaya Dutta Borah | Anupam Biswas

Legal documents are notorious for their complexity and domain-specific language, making them challenging for legal practitioners as well as non-experts to comprehend. To address this issue, the LegalEval 2023 track proposed several shared tasks, including the task of Rhetorical Roles Prediction (Task A). We participated as NITS_Legal team in Task A and conducted exploratory experiments to improve our understanding of the task. Our results suggest that sequence context is crucial in performing rhetorical roles prediction. Given the lengthy nature of legal documents, we propose a BiLSTM-based sentence sequence labeling approach that uses a local context-incorporated dataset created from the original dataset. To better represent the sentences during training, we extract legal domain-specific sentence embeddings from a Legal BERT model. Our experimental findings emphasize the importance of considering local context instead of treating each sentence independently to achieve better performance in this task. Our approach has the potential to improve the accessibility and usability of legal documents.

pdf abs
I2C-Huelva at SemEval-2023 Task 9: Analysis of Intimacy in Multilingual Tweets Using Resampling Methods and Transformers
Abel Pichardo Estevez | Jacinto Mata Vázquez | Victoria Pachón Álvarez | Nordin El Balima Cordero

Nowadays, intimacy is a fundamental aspect of how we relate to other people in social settings. The most frequent way in which we can determine a high level of intimacy is in the use of certain emoticons, curse words, verbs, etc. This paper presents the approach developed to solve SemEval 2023 task 9: Multiligual Tweet Intimacy Analysis. To address the task, a transfer learning approach was conducted by fine tuning various pre-trained languagemodels. Since the dataset supplied by the organizer was highly imbalanced, our main strategy to obtain high prediction values was the implementation of different oversampling and undersampling techniques on the training set. Our final submission achieved an overall Pearson’s r of 0.497.

pdf abs
I2C-Huelva at SemEval-2023 Task 10: Ensembling Transformers Models for the Detection of Online Sexism
Lavinia Felicia Fudulu | Alberto Rodriguez Tenorio | Victoria Pachón Álvarez | Jacinto Mata Vázquez

This work details our approach for addressing Tasks A and B of the Semeval 2023 Task 10: Explainable Detection of Online Sexism (EDOS). For Task A a simple ensemble based of majority vote system was presented. To build our proposal, first a review of transformers was carried out and the 3 best performing models were selected to be part of the ensemble. Next, for these models, the best hyperpameters were searched using a reduced data set. Finally, we trained these models using more data. During the development phase, our ensemble system achieved an f1-score of 0.8403. For task B, we developed a model based on the deBERTa transformer, utilizing the hyperparameters identified for task A. During the development phase, our proposed model attained an f1-score of 0.6467. Overall, our methodology demonstrates an effective approach to the tasks, leveraging advanced machine learning techniques and hyperparameters searches to achieve high performance in detecting and classifying instances of sexism in online text.

This paper describes our system used in the SemEval-2023 Task 9 Multilingual Tweet Intimacy Analysis. There are two key challenges in this task: the complexity of multilingual and zero-shot cross-lingual learning, and the difficulty of semantic mining of tweet intimacy. To solve the above problems, our system extracts contextual representations from the pretrained language models, XLM-T, and employs various optimization methods, including adversarial training, data augmentation, ordinal regression loss and special training strategy. Our system ranked 14th out of 54 participating teams on the leaderboard and ranked 10th on predicting languages not in the training data. Our code is available on Github.

pdf abs
NCUEE-NLP at SemEval-2023 Task 7: Ensemble Biomedical LinkBERT Transformers in Multi-evidence Natural Language Inference for Clinical Trial Data
Chao-Yi Chen | Kao-Yuan Tien | Yuan-Hao Cheng | Lung-Hao Lee

This study describes the model design of the NCUEE-NLP system for the SemEval-2023 NLI4CT task that focuses on multi-evidence natural language inference for clinical trial data. We use the LinkBERT transformer in the biomedical domain (denoted as BioLinkBERT) as our main system architecture. First, a set of sentences in clinical trial reports is extracted as evidence for premise-statement inference. This identified evidence is then used to determine the inference relation (i.e., entailment or contradiction). Finally, a soft voting ensemble mechanism is applied to enhance the system performance. For Subtask 1 on textual entailment, our best submission had an F1-score of 0.7091, ranking sixth among all 30 participating teams. For Subtask 2 on evidence retrieval, our best result obtained an F1-score of 0.7940, ranking ninth of 19 submissions.

pdf abs
Tsingriver at SemEval-2023 Task 10: Labeled Data Augmentation in Consistency Training
Yehui Xu | Haiyan Ding

Semi-supervised learning has promising performance in deep learning, one of the approaches is consistency training on a large amount of unlabeled data to constrain model predictions to be invariant to input noise. However, The degree of correlation between unlabeled data and task objective directly affects model prediction performance. This paper describes our system designed for SemEval-2023 Task 10: Explainable Detection of Online Sexism. We utilize a consistency training framework and data augmentation as the main strategy to train a model. The score obtained by our method is 0.8180 in subtask A, ranking 57 in all the teams.

pdf abs
UnedMediaBiasTeam @ SemEval-2023 Task 3: Can We Detect Persuasive Techniques Transferring Knowledge From Media Bias Detection?
Francisco-Javier Rodrigo-Ginés | Laura Plaza | Jorge Carrillo-de-Albornoz

How similar is the detection of media bias to the detection of persuasive techniques? We have explored how transferring knowledge from one task to the other may help to improve the performance. This paper presents the systems developed for participating in the SemEval-2023 Task 3: Detecting the Genre, the Framing, and the Persuasion Techniques in Online News in a Multi-lingual Setup. We have participated in both the subtask 1: News Genre Categorisation, and the subtask 3: Persuasion Techniques Detection. Our solutions are based on two-stage fine-tuned multilingual models. We evaluated our approach on the 9 languages provided in the task. Our results show that the use of transfer learning from media bias detection to persuasion techniques detection is beneficial for the subtask of detecting the genre (macro F1-score of 0.523 in the English test set) as it improves previous results, but not for the detection of persuasive techniques (micro F1-score of 0.24 in the English test set).

pdf abs
NL4IA at SemEval-2023 Task 3: A Comparison of Sequence Classification and Token Classification to Detect Persuasive Techniques
Albert Pritzkau

The following system description presents our approach to the detection of persuasion techniques in online news. The given task has been framed as a multi-label classification problem. In a multi-label classification problem, each input chunkin this case paragraphis assigned one of several class labels. Span level annotations were also provided. In order to assign class labels to the given documents, we opted for RoBERTa (A Robustly Optimized BERT Pretraining Approach) for both approachessequence and token classification. Starting off with a pre-trained model for language representation, we fine-tuned this model on the given classification task with the provided annotated data in supervised training steps.

pdf abs
IITD at SemEval-2023 Task 2: A Multi-Stage Information Retrieval Approach for Fine-Grained Named Entity Recognition
Shivani Choudhary | Niladri Chatterjee | Subir Saha

MultiCoNER-II is a fine-grained Named Entity Recognition (NER) task that aims to identify ambiguous and complex named entities in multiple languages, with a small amount of contextual information available. To address this task, we propose a multi-stage information retrieval (IR) pipeline that improves the performance of language models for fine-grained NER. Our approach involves leveraging a combination of a BM25-based IR model and a language model to retrieve relevant passages from a corpus. These passages are then used to train a model that utilizes a weighted average of losses. The prediction is generated by a decoder stack that includes a projection layer and conditional random field. To demonstrate the effectiveness of our approach, we participated in the English track of the MultiCoNER-II competition. Our approach yielded promising results, which we validated through detailed analysis.

This paper summarizes the participation of the L3i laboratory of the University of La Rochelle in the SemEval-2023 Task 2, Multilingual Complex Named Entity Recognition (MultiCoNER II). Similar to MultiCoNER I, the task seeks to develop methods to detect semantic ambiguous and complex entities in short and low-context settings. However, MultiCoNER II adds a fine-grained entity taxonomy with over 30 entity types and corrupted data on the test partitions. We approach these complications following prompt-based learning as (1) a ranking problem using a seq2seq framework, and (2) an extractive question-answering task. Our findings show that even if prompting techniques have a similar recall to fine-tuned hierarchical language model-based encoder methods, precision tends to be more affected.

pdf abs
CNLP-NITS at SemEval-2023 Task 10: Online sexism prediction, PREDHATE!
Advaitha Vetagiri | Prottay Adhikary | Partha Pakray | Amitava Das

Online sexism is a rising issue that threatens women’s safety, fosters hostile situations, and upholds social inequities. We describe a task SemEval-2023 Task 10 for creating English-language models that can precisely identify and categorize sexist content on internet forums and social platforms like Gab and Reddit as well to provide an explainability in order to address this problem. The problem is divided into three hierarchically organized subtasks: binary sexism detection, sexism by category, and sexism by fine-grained vector. The dataset consists of 20,000 labelled entries. For Task A, pertained models like Convolutional Neural Network (CNN) and Bidirectional Long Short-Term Memory (BiLSTM), which is called CNN-BiLSTM and Generative Pretrained Transformer 2 (GPT-2) models were used, as well as the GPT-2 model for Task B and C, and have provided experimental configurations. According to our findings, the GPT-2 model performs better than the CNN-BiLSTM model for Task A, while GPT-2 is highly accurate for Tasks B and C on the training, validation and testing splits of the training data provided in the task. Our proposed models allow researchers to create more precise and understandable models for identifying and categorizing sexist content in online forums, thereby empowering users and moderators.

This paper presents our solution, garNER, to the SemEval-2023 MultiConer task. We propose a knowledge augmentation approach by directly querying entities from the Wikipedia API and appending the summaries of the entities to the input sentence. These entities are either retrieved from the labeled training set (Gold Entity) or from off-the-shelf entity taggers (Entity Extractor). Ensemble methods are then applied across multiple models to get the final prediction. Our analysis shows that the added contexts are beneficial only when such contexts are relevant to the target-named entities, but detrimental when the contexts are irrelevant.

pdf abs
D2KLab at SemEval-2023 Task 2: Leveraging T-NER to Develop a Fine-Tuned Multilingual Model for Complex Named Entity Recognition
Thibault Ehrhart | Julien Plu | Raphael Troncy

This paper presents D2KLab’s system used for the shared task of “Multilingual Complex Named Entity Recognition (MultiCoNER II)”, as part of SemEval 2023 Task 2. The system relies on a fine-tuned transformer based language model for extracting named entities. In addition to the architecture of the system, we discuss our results and observations.

pdf abs
LTRC at SemEval-2023 Task 6: Experiments with Ensemble Embeddings
Pavan Baswani | Hiranmai Sri Adibhatla | Manish Shrivastava

In this paper, we present our team’s involvement in Task 6: LegalEval: Understanding Legal Texts. The task comprised three subtasks, and we focus on subtask A: Rhetorical Roles prediction. Our approach included experimenting with pre-trained embeddings and refining them with statistical and neural classifiers. We provide a thorough examination ofour experiments, solutions, and analysis, culminating in our best-performing model and current progress. We achieved a micro F1 score of 0.6133 on the test data using fine-tuned LegalBERT embeddings.

pdf abs
TeamAmpa at SemEval-2023 Task 3: Exploring Multilabel and Multilingual RoBERTa Models for Persuasion and Framing Detection
Amalie Pauli | Rafael Sarabia | Leon Derczynski | Ira Assent

This paper describes our submission to theSemEval 2023 Task 3 on two subtasks: detectingpersuasion techniques and framing. Bothsubtasks are multi-label classification problems. We present a set of experiments, exploring howto get robust performance across languages usingpre-trained RoBERTa models. We test differentoversampling strategies, a strategy ofadding textual features from predictions obtainedwith related models, and present bothinconclusive and negative results. We achievea robust ranking across languages and subtaskswith our best ranking being nr. 1 for Subtask 3on Spanish.

pdf abs
UM6P at SemEval-2023 Task 3: News genre classification based on transformers, graph convolution networks and number of sentences
Hamza Alami | Abdessamad Benlahbib | Abdelkader El Mahdaouy | Ismail Berrada

This paper presents our proposed method for english documents genre classification in the context of SemEval 2023 task 3, subtask 1. Our method use ensemble technique to combine four distinct models predictions: Longformer, RoBERTa, GCN, and a sentences number-based model. Each model is optimized on simple objectives and easy to grasp. We provide snippets of code that define each model to make the reading experience better. Our method ranked 12th in documents genre classification for english texts.

pdf abs
Viettel-AI at SemEval-2023 Task 6: Legal Document Understanding with Longformer for Court Judgment Prediction with Explanation
Thanh Dat Hoang | Chi Minh Bui | Nam Bui

Court Judgement Prediction with Explanation (CJPE) is a task in the field of legal analysis and evaluation, which involves predicting the outcome of a court case based on the available legal text and providing a detailed explanation of the prediction. This is an important task in the legal system as it can aid in decision-making and improve the efficiency of the court process. In this paper, we present a new approach to understanding legal texts, which are normally long documents, based on data-oriented methods. Specifically, we first try to exploit the characteristic of data to understand the legal texts. The output is then used to train the model using the Longformer architecture. Regarding the experiment, the proposed method is evaluated on the sub-task CJPE of the SemEval-2023 Task 6. Accordingly, our method achieves top 1 and top 2 on the classification task and explanation task, respectively. Furthermore, we present several open research issues for further investigations in order to improve the performance in this research field.

pdf abs
GunadarmaXBRIN at SemEval-2023 Task 12: Utilization of SVM and AfriBERTa for Monolingual, Multilingual, and Zero-shot Sentiment Analysis in African Languages
Novitasari Arlim | Slamet Riyanto | Rodiah Rodiah | Al Hafiz Akbar Maulana Siagian

This paper describes our participation in Task 12: AfriSenti-SemEval 2023, i.e., track 12 of subtask A, track 16 of subtask B, and track 18 of subtask C. To deal with these three tracks, we utilize Support Vector Machine (SVM) + One vs Rest, SVM + One vs Rest with SMOTE, and AfriBERTa-large models. In particular, our SVM + One vs Rest with SMOTE model could obtain the highest weighted F1-Score for tracks 16 and 18 in the evaluation phase, that is, 65.14% and 33.49%, respectively. Meanwhile, our SVM + One vs Rest model could perform better than other models for track 12 in the evaluation phase.

pdf abs
MEERQAT-IRIT at SemEval-2023 Task 2: Leveraging Contextualized Tag Descriptors for Multilingual Named Entity Recognition
Jesus Lovon-Melgarejo | Jose G. Moreno | Romaric Besançon | Olivier Ferret | Lynda Lechani

This paper describes the system we submitted to the SemEval 2023 Task 2 Multilingual Complex Named Entity Recognition (MultiCoNER II) in four monolingual tracks (English, Spanish, French, and Portuguese). Considering the low context setting and the fine-grained taxonomy presented in this task, we propose a system that leverages the language model representations using hand-crafted tag descriptors. We explored how integrating the contextualized representations of tag descriptors with a language model can help improve the model performance for this task. We performed our evaluations on the development and test sets used in the task for the Practice Phase and the Evaluation Phase respectively.

This paper presents proposed solutions for addressing two subtasks in SemEval-2023 Task 3: “Detecting the Genre, the Framing, and the Persuasion techniques in online news in a multi-lingual setup. In subtask 1, “News Genre Categorisation, the goal is to classify a news article as an opinion, a report, or a satire. In subtask 3, “Detection of Persuasion Technique, the system must reveal persuasion techniques used in each news article paragraph choosing among23 defined methods. Solutions leverage the application of the eXplainable Artificial Intelligence (XAI) method, Shapley Additive Explanations (SHAP). In subtask 1, SHAP was used to understand what was driving the model to fail so that it could be improved accordingly. In contrast, in subtask 3, a re-calibration of the Attention Mechanism was realized by extracting critical tokens for each persuasion technique. The underlying idea is the exploitation of XAI for countering the overfitting of the resulting model and attempting to improve the performance when there are few samples in the training data. The achieved performance on English for subtask 1 ranked 6th with an F1-score of 58.6% (despite 78.4% of the 1st) and for subtask 3 ranked 12th with a micro-averaged F1-score of 29.8% (despite 37.6% of the 1st).

Sexism is an injustice afflicting women and has become a common form of oppression in social media. In recent years, the automatic detection of sexist instances has been utilized to combat this oppression. The Subtask A of SemEval-2023 Task 10, Explainable Detection of Online Sexism, aims to detect whether an English-language post is sexist. In this paper, we describe our system for the competition. The structure of the classification model is based on RoBERTa, and we further pre-train it on the domain corpus. For fine-tuning, we adopt Unsupervised Data Augmentation (UDA), a semi-supervised learning approach, to improve the robustness of the system. Specifically, we employ Easy Data Augmentation (EDA) method as the noising operation for consistency training. We train multiple models based on different hyperparameter settings and adopt the majority voting method to predict the labels of test entries. Our proposed system achieves a Macro-F1 score of 0.8352 and a ranking of 41/84 on the leaderboard of Subtask A.

pdf abs
NetEase.AI at SemEval-2023 Task 2: Enhancing Complex Named Entities Recognition in Noisy Scenarios via Text Error Correction and External Knowledge
Ruixuan Lu | Zihang Tang | Guanglong Hu | Dong Liu | Jiacheng Li

Complex named entities (NE), like the titles of creative works, are not simple nouns and pose challenges for NER systems. In the SemEval 2023, Task 2: MultiCoNER II was proposed, whose goal is to recognize complex entities against out of knowledge-base entities and noisy scenarios. To address the challenges posed by MultiCoNER II, our team NetEase.AI proposed an entity recognition system that integrates text error correction system and external knowledge, which can recognize entities in scenes that contain entities out of knowledge base and text with noise. Upon receiving an input sentence, our systems will correct the sentence, extract the entities in the sentence as candidate set using the entity recognition model that incorporates the gazetteer information, and then use the external knowledge to classify the candidate entities to obtain entity type features. Finally, our system fused the multi-dimensional features of the candidate entities into a stacking model, which was used to select the correct entities from the candidate set as the final output. Our system exhibited good noise resistance and excellent entity recognition performance, resulting in our team’s first place victory in the Chinese track of MultiCoNER II.

pdf abs
IRIT_IRIS_A at SemEval-2023 Task 6: Legal Rhetorical Role Labeling Supported by Dynamic-Filled Contextualized Sentence Chunks
Alexandre Gomes de Lima | Jose G. Moreno | Eduardo H. da S. Aranha

This work presents and evaluates an approach to efficiently leverage the context exploitation ability of pre-trained Transformer models as a way of boosting the performance of models tackling the Legal Rhetorical Role Labeling task. The core idea is to feed the model with sentence chunks that are assembled in a way that avoids the insertion of padding tokens and the truncation of sentences and, hence, obtain better sentence embeddings. The achieved results show that our proposal is efficient, despite its simplicity, since models based on it overcome strong baselines by 3.76% in the worst case and by 8.71% in the best case.

pdf abs
Togedemaru at SemEval-2023 Task 8: Causal Medical Claim Identification and Extraction from Social Media Posts
Andra Oica | Daniela Gifu | Diana Trandabat

The “Causal Medical Claim Identification and Extraction from Social Media Posts task at SemEval 2023 competition focuses on identifying and validating medical claims in English, by posing two subtasks on causal claim identification and PIO (Population, Intervention, Outcome) frame extraction. In the context of SemEval, we present a method for sentence classification in four categories (claim, experience, experience_based_claim or a question) based on BioBERT model with a MLP layer. The website from which the dataset was gathered, Reddit, is a social news and content discussion site. The evaluation results show the effectiveness of the solution of this study (83.68%).

pdf abs
FramingFreaks at SemEval-2023 Task 3: Detecting the Category and the Framing of Texts as Subword Units with Traditional Machine Learning
Rosina Baumann | Sabrina Deisenhofer

This paper describes our participation as team FramingFreaks in the SemEval-2023 task 3 “Category and Framing Predictions in online news in a multi-lingual setup.” We participated in subtasks 1 and 2. Our approach was to classify texts by splitting them into subwords to reduce the feature set size and then using these tokens as input in Support Vector Machine (SVM) or logistic regression classifiers. Our results are similar to the baseline results.

pdf abs
Lazybob at SemEval-2023 Task 9: Quantifying Intimacy of Multilingual Tweets with Multi-Task Learning
Mengfei Yuan | Cheng Chen

This study presents a systematic method for analyzing the level of intimacy in tweets across ten different languages, using multi-task learning for SemEval 2023 Task 9: Multilingual Tweet Intimacy Analysis. The system begins with the utilization of the official training data, and then we experiment with different fine-tuning tricks and effective strategies, such as data augmentation, multi-task learning, etc. Through additional experiments, the approach is shown to be effective for the task. To enhance the model’s robustness, different transformer-based language models and some widely-used plug-and-play priors are incorporated into our system. Our final submission achieved a Pearson R of 0.6160 for the intimacy score on the official test set, placing us at the top of the leader board among 45 teams.

pdf abs
HITSZQ at SemEval-2023 Task 10: Category-aware Sexism Detection Model with Self-training Strategy
Ziyi Yao | Heyan Chai | Jinhao Cui | Siyu Tang | Qing Liao

This paper describes our system used in the SemEval-2023 \textit{Task 10 Explainable Detection of Online Sexism (EDOS)}. Specifically, we participated in subtask B: a 4-class sexism classification task, and subtask C: a more fine-grained (11-class) sexism classification task, where it is necessary to predict the category of sexism. We treat these two subtasks as one multi-label hierarchical text classification problem, and propose an integrated sexism detection model for improving the performance of the sexism detection task. More concretely, we use the pre-trained BERT model to encode the text and class label and a hierarchy-relevant structure encoder is employed to model the relationship between classes of subtasks B and C. Additionally, a self-training strategy is designed to alleviate the imbalanced problem of distribution classes. Extensive experiments on subtasks B and C demonstrate the effectiveness of our proposed approach.

pdf abs
mCPT at SemEval-2023 Task 3: Multilingual Label-Aware Contrastive Pre-Training of Transformers for Few- and Zero-shot Framing Detection
Markus Reiter-Haas | Alexander Ertl | Kevin Innerhofer | Elisabeth Lex

This paper presents the winning system for the zero-shot Spanish framing detection task, which also achieves competitive places in eight additional languages. The challenge of the framing detection task lies in identifying a set of 14 frames when only a few or zero samples are available, i.e., a multilingual multi-label few- or zero-shot setting. Our developed solution employs a pre-training procedure based on multilingual Transformers using a label-aware contrastive loss function. In addition to describing the system, we perform an embedding space analysis and ablation study to demonstrate how our pre-training procedure supports framing detection to advance computational framing analysis.

pdf abs
SSNSheerinKavitha at SemEval-2023 Task 7: Semantic Rule Based Label Prediction Using TF-IDF and BM25 Techniques
Sheerin Sitara Noor Mohamed | Kavitha Srinivasan

The advancement in the healthcare sector assures improved diagnosis and supports appropriate decision making in medical domain. The medical domain data can be either radiology images or clinical data. The clinical data plays a major role in the healthcare sector by preventing and treating the health problem based on the evidence learned from the trials. This paper is related to multi-evidence natural language inference for clinical trial data analysis and its solution for the given subtasks (SemEval 2023 Task 7 - NLI4CT). In subtask 1 of NLI4CT, the inference relationship (entailment or contradiction) between the Clinical Trial Reports (CTRs) statement pairs with respect to the Clinical Trial Data (CTD) statement are determined. In subtask 2 of NLI4CT, predicted label (inference relationship) are defined and justified using set of supporting facts extracted from the premises. The objective of this work is to derive the conclusion from premises (CTRs statement pairs) and extracting the supporting premises using proposed Semantic Rule based Clinical Data Analysis (SRCDA) approach. From the results, the proposed model attained an highest F1-score of 0.667 and 0.716 for subtasks 1 and 2 respectively. The novelty of this proposed approach includes, creation of External Knowledge Base (EKB) along with its suitable semantic rules based on the input statements.

pdf abs
Janko at SemEval-2023 Task 2: Bidirectional LSTM Model Based on Pre-training for Chinese Named Entity Recognition
Jiankuo Li | Zhengyi Guan | Haiyan Ding

This paper describes the method we submitted as the Janko team in the SemEval-2023 Task 2,Multilingual Complex Named Entity Recognition (MultiCoNER 2). We only participated in the Chinese track. In this paper, we implement the BERT-BiLSTM-RDrop model. We use the fine-tuned BERT models, take the output of BERT as the input of the BiLSTM network, and finally use R-Drop technology to optimize the loss function. Our submission achieved a macro-averaged F1 score of 0.579 on the testset.

pdf abs
HHS at SemEval-2023 Task 10: A Comparative Analysis of Sexism Detection Based on the RoBERTa Model
Yao Zhang | Liqing Wang

This paper describes the methods and models applied by our team HHS in SubTask-A of SemEval-2023 Task 10 about sexism detection. In this task, we trained with the officially released data and analyzed the performance of five models, TextCNN, BERT, RoBERTa, XLNet, and Sup-SimCSE-RoBERTa. The experiments show that most of the models can achieve good results. Then, we tried data augmentation, model ensemble, dropout, and other operations on several of these models, and compared the results for analysis. In the end, the most effective approach that yielded the best results on the test set involved the following steps: enhancing the sexist data using dropout, feeding it as input to the Sup-SimCSE-RoBERTa model, and providing the raw data as input to the XLNet model. Then, combining the outputs of the two methods led to even better results. This method yielded a Macro-F1 score of 0.823 in the final evaluation phase of the SubTask-A of the competition.

This paper describes an approach to automat- ically close the knowledge gap of Clickbait- Posts via a transformer model trained for Question-Answering, augmented by a task- specific post-processing step. This was part of the SemEval 2023 Clickbait shared task (Frbe et al., 2023a) - specifically task 2. We devised strategies to improve the existing model to fit the task better, e.g. with different special mod- els and a post-processor tailored to different inherent challenges of the task. Furthermore, we explored the possibility of expanding the original training data by using strategies from Heuristic Labeling and Semi-Supervised Learn- ing. With those adjustments, we were able to improve the baseline by 9.8 percentage points to a BLEU-4 score of 48.0%.

pdf abs
University at Buffalo at SemEval-2023 Task 11: MASDA–Modelling Annotator Sensibilities through DisAggregation
Michael Sullivan | Mohammed Yasin | Cassandra L. Jacobs

Modeling the most likely label when an annotation task is perspective-dependent discards relevant sources of variation that come from the annotators themselves. We present three approaches to modeling the controversiality of a particular text. First, we explicitly represented annotators using annotator embeddings to predict the training signals of each annotator’s selections in addition to a majority class label. This method leads to reduction in error relative to models without these features, allowing the overall result to influence the weights of each annotator on the final prediction. In a second set of experiments, annotators were not modeled individually but instead annotator judgments were combined in a pairwise fashion that allowed us to implicitly combine annotators. Overall, we found that aggregating and explicitly comparing annotators’ responses to a static document representation produced high-quality predictions in all datasets, though some systems struggle to account for large or variable numbers of annotators.

pdf abs
SINAI at SemEval-2023 Task 10: Leveraging Emotions, Sentiments, and Irony Knowledge for Explainable Detection of Online Sexism
María Estrella Vallecillo Rodrguez | Flor Miriam Plaza Del Arco | L. Alfonso Ureña López | M. Teresa Martín Valdivia

This paper describes the participation of SINAI research team in the Explainable Detection of Online Sexism (EDOS) Shared Task at SemEval 2023. Specifically, we participate in subtask A (binary sexism detection), subtask B (category of sexism), and subtask C (fine-grained vector of sexism). For the three subtasks, we propose a system that integrates information related to emotions, sentiments, and irony in order to check whether these features help detect sexism content. Our team ranked 46th in subtask A, 37th in subtask B, and 29th in subtask C, achieving 0.8245, 0.6043, and 0.4376 of macro f1-score, respectively, among the participants.

pdf abs
Saama AI Research at SemEval-2023 Task 7: Exploring the Capabilities of Flan-T5 for Multi-evidence Natural Language Inference in Clinical Trial Data
Kamal Raj Kanakarajan | Malaikannan Sankarasubbu

The goal of the NLI4CT task is to build a Natural Language Inference system for Clinical Trial Reports that will be used for evidence interpretation and retrieval. Large Language models have demonstrated state-of-the-art performance in various natural language processing tasks across multiple domains. We suggest using an instruction-finetuned Large Language Models (LLMs) to take on this particular task in light of these developments. We have evaluated the publicly available LLMs under zeroshot setting, and finetuned the best performing Flan-T5 model for this task. On the leaderboard, our system ranked second, with an F1 Score of 0.834 on the official test set.

pdf abs
UM6P at SemEval-2023 Task 12: Out-Of-Distribution Generalization Method for African Languages Sentiment Analysis
Abdelkader El Mahdaouy | Hamza Alami | Salima Lamsiyah | Ismail Berrada

This paper presents our submitted system to AfriSenti SemEval-2023 Task 12: Sentiment Analysis for African Languages. The AfriSenti consists of three different tasks, covering monolingual, multilingual, and zero-shot sentiment analysis scenarios for African languages. To improve model generalization, we have explored the following steps: 1) further pre-training of the AfroXLM Pre-trained Language Model (PLM), 2) combining AfroXLM and MARBERT PLMs using a residual layer, and 3) studying the impact of metric learning and two out-of-distribution generalization training objectives. The overall evaluation results show that our system has achieved promising results on several sub-tasks of Task A. For Tasks B and C, our system is ranked among the top six participating systems.

pdf abs
MarSan at SemEval-2023 Task 10: Can Adversarial Training with help of a Graph Convolutional Network Detect Explainable Sexism?
Ehsan Tavan | Maryam Najafi

This paper describes SemEval-2022’s shared task “Explainable Detection of Online Sexism”. The fine-grained classification of sexist content plays a major role in building explainable frameworks for online sexism detection. We hypothesize that by encoding dependency information using Graph Convolutional Networks (GCNs) we may capture more stylistic information about sexist contents. Online sexism has the potential to cause significant harm to women who are the targets of such behavior. It not only creates unwelcoming and inaccessible spaces for women online but also perpetuates social asymmetries and injustices. We believed improving the robustness and generalization ability of neural networks during training will allow models to capture different belief distributions for sexism categories. So we proposed adversarial training with GCNs for explainable detection of online sexism. In the end, our proposed method achieved very competitive results in all subtasks and shows that adversarial training of GCNs is a promising method for the explainable detection of online sexism.

pdf abs
UZH_CLyp at SemEval-2023 Task 9: Head-First Fine-Tuning and ChatGPT Data Generation for Cross-Lingual Learning in Tweet Intimacy Prediction
Andrianos Michail | Stefanos Konstantinou | Simon Clematide

This paper describes the submission of UZH_CLyp for the SemEval 2023 Task 9 “Multilingual Tweet Intimacy Analysis. We achieved second-best results in all 10 languages according to the official Pearson’s correlation regression evaluation measure. Our cross-lingual transfer learning approach explores the benefits of using a Head-First Fine-Tuning method (HeFiT) that first updates only the regression head parameters and then also updates the pre-trained transformer encoder parameters at a reduced learning rate. Additionally, we study the impact of using a small set of automatically generated examples (in our case, from ChatGPT) for low-resource settings where no human-labeled data is available. Our study shows that HeFiT stabilizes training and consistently improves results for pre-trained models that lack domain adaptation to tweets. Our study also shows a noticeable performance increase in cross-lingual learning when synthetic data is used, confirming the usefulness of current text generation systems to improve zeroshot baseline results. Finally, we examine how possible inconsistencies in the annotated data contribute to cross-lingual interference issues.

pdf abs
CICL_DMS at SemEval-2023 Task 11: Learning With Disagreements (Le-Wi-Di)
Dennis Grötzinger | Simon Heuschkel | Matthias Drews

In this system paper, we describe our submission for the 11th task of SemEval2023: Learning with Disagreements, or Le-Wi-Di for short. In the task, the assumption that there is a single gold label in NLP tasks such as hate speech or misogyny detection is challenged, and instead the opinions of multiple annotators are considered. The goal is instead to capture the agreements/disagreements of the annotators. For our system, we utilize the capabilities of modern large-language models as our backbone and investigate various techniques built on top, such as ensemble learning, multi-task learning, or Gaussian processes. Our final submission shows promising results and we achieve an upper-half finish.

pdf abs
Aristoxenus at SemEval-2023 Task 4: A Domain-Adapted Ensemble Approach to the Identification of Human Values behind Arguments
Dimitrios Zaikis | Stefanos D. Stefanidis | Konstantinos Anagnostopoulos | Ioannis Vlahavas

This paper presents our system for the SemEval-2023 Task 4, which aims to identify human values behind arguments by classifying whether or not an argument draws on a specific category. Our approach leverages a second-phase pre-training method to adapt a RoBERTa Language Model (LM) and tackles the problem using a One-Versus-All strategy. Final predictions are determined by a majority voting module that combines the outputs of an ensemble of three sets of per-label models. We conducted experiments to evaluate the impact of different pre-trained LMs on the task, comparing their performance in both pre-trained and task-adapted settings. Our findings show that fine-tuning the RoBERTa LM on the task-specific dataset improves its performance, outperforming the best-performing baseline BERT approach. Overall, our approach achieved a macro-F1 score of 0.47 on the official test set, demonstrating its potential in identifying human values behind arguments.

pdf abs
Augustine of Hippo at SemEval-2023 Task 4: An Explainable Knowledge Extraction Method to Identify Human Values in Arguments with SuperASKE
Alfio Ferrara | Sergio Picascia | Elisabetta Rocchetti

In this paper we present and discuss the results achieved by the “Augustine of Hippo” team at SemEval-2023 Task 4 about human value detection. In particular, we provide a quantitative and qualitative reviews of the results obtained by SuperASKE, discussing respectively performance metrics and classification errors. Finally, we present our main contribution: an explainable and unsupervised approach mapping arguments to concepts, followed by a supervised classification model mapping concepts to human values.

pdf abs
UIO at SemEval-2023 Task 12: Multilingual fine-tuning for sentiment classification in low-resource Languages
Egil Rønningstad

Our contribution to the 2023 AfriSenti-SemEval shared task 12: Sentiment Analysis for African Languages, provides insight into how a multilingual large language model can be a resource for sentiment analysis in languages not seen during pretraining. The shared task provides datasets of a variety of African languages from different language families. The languages are to various degrees related to languages used during pretraining, and the language data contain various degrees of code-switching. We experiment with both monolingual and multilingual datasets for the final fine-tuning, and find that with the provided datasets that contain samples in the thousands, monolingual fine-tuning yields the best results.

pdf abs
UMUTeam at SemEval-2023 Task 11: Ensemble Learning applied to Binary Supervised Classifiers with disagreements
José Antonio García-Díaz | Ronghao Pan | Gema Alcaráz-Mármol | María José Marín-Pérez | Rafael Valencia-García

This paper describes the participation of the UMUTeam in the Learning With Disagreements (Le-Wi-Di) shared task proposed at SemEval 2023, which objective is the development of supervised automatic classifiers that consider, during training, the agreements and disagreements among the annotators of the datasets. Specifically, this edition includes a multilingual dataset. Our proposal is grounded on the development of ensemble learning classifiers that combine the outputs of several Large Language Models. Our proposal ranked position 18 of a total of 30 participants. However, our proposal did not incorporate the information about the disagreements. In contrast, we compare the performance of building several classifiers for each dataset separately with a merged dataset.

pdf abs
Matt Bai at SemEval-2023 Task 5: Clickbait spoiler classification via BERT
Nukit Tailor | Radhika Mamidi

The Clickbait Spoiling shared task aims at tackling two aspects of spoiling: classifying the spoiler type based on its length and generating the spoiler. This paper focuses on the task of classifying the spoiler type. Better classification of the spoiler type would eventually help in generating a better spoiler for the post. We use BERT-base (cased) to classify the clickbait posts. The model achieves a balanced accuracy of 0.63 as we give only the post content as the input to our model instead of the concatenation of the post title and post content to find out the differences that the post title might be bringing in.

pdf abs
shefnlp at SemEval-2023 Task 10: Compute-Efficient Category Adapters
Thomas Pickard | Tyler Loakman | Mugdha Pandya

As social media platforms grow, so too does the volume of hate speech and negative sentiment expressed towards particular social groups. In this paper, we describe our approach to SemEval-2023 Task 10, involving the detection and classification of online sexism (abuse directed towards women), with fine-grained categorisations intended to facilitate the development of a more nuanced understanding of the ideologies and processes through which online sexism is expressed. We experiment with several approaches involving language model finetuning, class-specific adapters, and pseudo-labelling. Our best-performing models involve the training of adapters specific to each subtask category (combined via fusion layers) using a weighted loss function, in addition to performing naive pseudo-labelling on a large quantity of unlabelled data. We successfully outperform the baseline models on all 3 subtasks, placing 56th (of 84) on Task A, 43rd (of 69) on Task B,and 37th (of 63) on Task C.

pdf abs
xiacui at SemEval-2023 Task 11: Learning a Model in Mixed-Annotator Datasets Using Annotator Ranking Scores as Training Weights
Xia Cui

This paper describes the development of a system for SemEval-2023 Shared Task 11 on Learning with Disagreements (Le-Wi-Di). Labelled data plays a vital role in the development of machine learning systems. The human-annotated labels are usually considered the truth for training or validation. To obtain truth labels, a traditional way is to hire domain experts to perform an expensive annotation process. Crowd-sourcing labelling is comparably cheap, whereas it raises a question on the reliability of annotators. A common strategy in a mixed-annotator dataset with various sets of annotators for each instance is to aggregate the labels among multiple groups of annotators to obtain the truth labels. However, these annotators might not reach an agreement, and there is no guarantee of the reliability of these labels either. With further problems caused by human label variation, subjective tasks usually suffer from the different opinions provided by the annotators. In this paper, we propose two simple heuristic functions to compute the annotator ranking scores, namely AnnoHard and AnnoSoft, based on the hard labels (i.e., aggregative labels) and soft labels (i.e., cross-entropy values). By introducing these scores, we adjust the weights of the training instances to improve the learning with disagreements among the annotators.

pdf abs
Team ISCL_WINTER at SemEval-2023 Task 12:AfriSenti-SemEval: Sentiment Analysis for Low-resource African Languages using Twitter Dataset
Alina Hancharova | John Wang | Mayank Kumar

This paper presents a study on the effectiveness of various approaches for addressing the challenge of multilingual sentiment analysis in low-resource African languages. . The approaches evaluated in the study include Support Vector Machines (SVM), translation, and an ensemble of pre-trained multilingual sentimental models methods. The paper provides a detailed analysis of the performance of each approach based on experimental results. In our findings, we suggest that the ensemble method is the most effective with an F1-Score of 0.68 on the final testing. This system ranked 19 out of 33 participants in the competition.

pdf abs
Jack-Ryder at SemEval-2023 Task 5: Zero-Shot Clickbait Spoiling by Rephrasing Titles as Questions
Dirk Wangsadirdja | Jan Pfister | Konstantin Kobs | Andreas Hotho

In this paper, we describe our approach to the clickbait spoiling task of SemEval 2023.The core idea behind our system is to leverage pre-trained models capable of Question Answering (QA) to extract the spoiler from article texts based on the clickbait title without any task-specific training. Since oftentimes, these titles are not phrased as questions, we automatically rephrase the clickbait titles as questions in order to better suit the pretraining task of the QA-capable models. Also, to fit as much relevant context into the model’s limited input size as possible, we propose to reorder the sentences by their relevance using a semantic similarity model. Finally, we evaluate QA as well as text generation models (via prompting) to extract the spoiler from the text. Based on the validation data, our final model selects each of these components depending on the spoiler type and achieves satisfactory zero-shot results. The ideas described in this paper can easily be applied in fine-tuning settings.

pdf abs
MLModeler5 at SemEval-2023 Task 3: Detecting the Category and the Framing Techniques in Online News in a Multi-lingual Setup
Arjun Khanchandani | Nitansh Jain | Jatin Bedi

System Description Paper for Task 3 Subtask 1 and 2 of Semeval 2023. The paper describes our approach to handling the News Genre Categorisation and Framing Detection using RoBERTa and ALBERT models.

pdf abs
DS at SemEval-2023 Task 10: Explaining Online Sexism using Transformer based Approach
Madisetty Padmavathi

In this paper, I describe the approach used in the SemEval 2023 - Task 10 Explainable Detection of Online Sexism (EDOS) competition (Kirk et al., 2023). I use different transformermodels, including BERT and RoBERTa which were fine-tuned on the EDOS dataset to classify text into different categories of sexism. I participated in three subtasks: subtask A is to classify given text as either sexist or not, while subtask B is to identify the specific category of sexism, such as (1) threats, (2) derogation, (3) animosity, (4) prejudiced discussions. Finally, subtask C involves predicting a finegrained vector representation of sexism, which included information about the severity, target and type of sexism present in the text. The use of transformer models allows the system to learn from the input data and make predictions on unseen text. By fine-tuning the models on the EDOS dataset, the system can improve its performance on the specific task of detecting online sexism. I got the following macro F1 scores: subtask A:77.16, subtask B: 46.11, and subtask C: 30.2.

pdf abs
FII_Better at SemEval-2023 Task 2: MultiCoNER II Multilingual Complex Named Entity Recognition
Viorica-Camelia Lupancu | Alexandru-Gabriel Platica | Cristian-Mihai Rosu | Daniela Gifu | Diana Trandabat

This task focuses on identifying complex named entities (NEs) in several languages. In the context of SemEval-2023 competition, our team presents an exploration of a base transformer model’s capabilities regarding the task, focused more specifically on five languages (English, Spanish, Swedish, German, Italian). We take DistilBERT and BERT as two examples of basic transformer models, using DistilBERT as a baseline and BERT as the platform to create an improved model. The dataset that we are using, MultiCoNER II, is a large multilingual dataset used for NER, that covers domains like: Wiki sentences, questions and search queries across 12 languages. This dataset contains 26M tokens and it is assembled from public resources. MultiCoNER II defines a NER tag-set with 6 classes and 67 tags. We have managed to get moderate results in the English track (we ranked 17th out of 34), while our results in the other tracks could be further improved in the future (overall third to last).

pdf abs
Brainstormers_msec at SemEval-2023 Task 10: Detection of sexism related comments in social media using deep learning
C. Jerin Mahibha | C. M Swaathi | R. Jeevitha | R. Princy Martina | Durairaj Thenmozhi

Social media is the media through which people share their thoughts and opinions. This has both its pros and cons which depends on the type of information being conveyed. If any information conveyed over social media hurts or affects a person, such information can be removed as it may disturb their mental health and may decrease their self confidence. During the last decade, hateful and sexist content towards women in being increasingly spread on social networks. The exposure to sexist speech has serious consequences to women’s life and limits their freedom of speech. Sexism is expressed in very different forms: it includes subtle stereotypes and attitudes that, although frequently unnoticed, are extremely harmful for both women and society. Sexist comments have a major impact on women being subjected to it. We as a team participated in the shared task Explainable Detection of Online Sexism (EDOS) at SemEval 2023 and have proposed a model which identifies the sexist comments and its type from English social media posts using the data set shared for the task. Different transformer model like BERT , DistilBERT and RoBERT are used by the proposed model for implementing all the three tasks shared by EDOS. On using the BERT model, macro F1 score of 0.8073, 0.5876 and 0.3729 are achieved for Task A, Task B and Task C respectively.

pdf abs
VTCC-NLP at SemEval-2023 Task 6:Long-Text Representation Based on Graph Neural Network for Rhetorical Roles Prediction
Hiep Nguyen | Hoang Ngo | Nam Bui

Rhetorical Roles (RR) prediction is to predict the label of each sentence in legal documents, which is regarded as an emergent task for legal document understanding. In this study, we present a novel method for the RR task by exploiting the long context representation. Specifically, legal documents are known as long texts, in which previous works have no ability to consider the inherent dependencies among sentences. In this paper, we propose GNNRR (Graph Neural Network for Rhetorical Roles Prediction), which is able to model the cross-information for long texts. Furthermore, we develop multitask learning by incorporating label shift prediction (LSP) for segmenting a legal document. The proposed model is evaluated on the SemEval 2023 Task 6 - Legal Eval Understanding Legal Texts for RR sub-task. Accordingly, our method achieves the top 4 in the public leaderboard of the sub-task. Our source code is available for further investigation\footnote{https://github.com/hiepnh137/SemEval2023-Task6-Rhetorical-Roles}.

pdf abs
Minanto at SemEval-2023 Task 2: Fine-tuning XLM-RoBERTa for Named Entity Recognition on English Data
Antonia Höfer | Mina Mottahedin

Within the scope of the shared task MultiCoNER II our aim was to improve the recognition of named entities in English. We as team Minanto fine-tuned a cross-lingual model for Named Entity Recognition on English data and achieved an average F1 score of 51.47\% in the final submission. We found that a monolingual model works better on English data than a cross-lingual and that the input of external data from earlier Named Entity Recognition tasks provides only minor improvements. In this paper we present our system, discuss our results and analyze the impact of external data.

pdf abs
SAB at SemEval-2023 Task 2: Does Linguistic Information Aid in Named Entity Recognition?
Siena Biales

This paper describes the submission to SemEval-2023 Task 2: Multilingual Complex Named Entity Recognition (MultiCoNER II) by team SAB. This task aims to encourage growth in the field of Named Entity Recognition (NER) by focusing on complex and difficult categories of entities, in 12 different language tracks. The task of NER has historically shown the best results when a model incorporates an external knowledge base or gazetteer, however, less research has been applied to examining the effects of incorporating linguistic information into the model. In this task, we explored combining NER, part-of-speech (POS), and dependency relation labels into a multi-task model and report on the findings. We determine that the addition of POS and dependency relation information in this manner does not improve results.

pdf abs
UniBoe’s at SemEval-2023 Task 10: Model-Agnostic Strategies for the Improvement of Hate-Tuned and Generative Models in the Classification of Sexist Posts
Arianna Muti | Francesco Fernicola | Alberto Barrón-Cedeño

We present our submission to SemEval-2023 Task 10: Explainable Detection of Online Sexism (EDOS). We address all three tasks: Task A consists of identifying whether a post is sexist. If so, Task B attempts to assign it one of four categories: threats, derogation, animosity, and prejudiced discussions. Task C aims for an even more fine-grained classification, divided among 11 classes. Our team UniBoe’s experiments with fine-tuning of hate-tuned Transformer-based models and priming for generative models. In addition, we explore model-agnostic strategies, such as data augmentation techniques combined with active learning, as well as obfuscation of identity terms. Our official submissions obtain an F1_score of 0.83 for Task A, 0.58 for Task B and 0.32 for Task C.

pdf abs
NLPeople at SemEval-2023 Task 2: A Staged Approach for Multilingual Named Entity Recognition
Mohab Elkaref | Nathan Herr | Shinnosuke Tanaka | Geeth De Mel

The MultiCoNER II shared task aims at detecting complex, ambiguous named entities with fine-grained types in a low context setting. Previous winning systems incorporated external knowledge bases to retrieve helpful contexts. In our submission we additionally propose splitting the NER task into two stages, a Span Extraction Step, and an Entity Classification step. Our results show that the former does not suffer from the low context setting comparably, and in so leading to a higher overall performance for an external KB-assisted system. We achieve 3rd place on the multilingual track and an average of 6th place overall.

pdf abs
NITK_LEGAL at SemEval-2023 Task 6: A Hierarchical based system for identification of Rhetorical Roles in legal judgements
Patchipulusu Sindhu | Diya Gupta | Sanjeevi Meghana | Anand Kumar M

The ability to automatically recognise the rhetorical roles of sentences in a legal case judgement is a crucial challenge to tackle since it can be useful for a number of activities that come later, such as summarising legal judgements and doing legal searches. The task is exigent since legal case documents typically lack structure, and their rhetorical roles could be subjective. This paper describes SemEval-2023 Task 6: LegalEval: Understanding Legal Texts, Sub-task A: Rhetorical Roles Prediction (RR). We propose a system to automatically generate rhetorical roles of all the sentences in a legal case document using Hierarchical Bi-LSTM CRF model and RoBERTa transformer. We also showcase different techniques used to manipulate dataset to generate a set of varying embeddings and train the Hierarchical Bi-LSTM CRF model to achieve better performance. Among all, model trained with the sent2vec embeddings concatenated with the handcrafted features perform better with the micro f1-score of 0.74 on test data.

In this paper, we have performed sentiment analysis on three African languages (Hausa, Swahili, and Yoruba). We used various deep learning and traditional models paired with a vectorizer for classification and data -preprocessing. We have also used a few data oversampling methods to handle the imbalanced text data. Thus, we could analyze the performance of those models in all the languages by using weighted and macro F1 scores as evaluation metrics.

pdf abs
HHU at SemEval-2023 Task 3: An Adapter-based Approach for News Genre Classification
Fabian Billert | Stefan Conrad

This paper describes our approach for Subtask 1 of Task 3 at SemEval-2023. In this subtask, task participants were asked to classify multilingual news articles for one of three classes: Reporting, Opinion Piece or Satire. By training an AdapterFusion layer composing the task-adapters from different languages, we successfully combine the language-exclusive knowledge and show that this improves the results in nearly all cases, including in zero-shot scenarios.

pdf abs
GMNLP at SemEval-2023 Task 12: Sentiment Analysis with Phylogeny-Based Adapters
Md Mahfuz Ibn Alam | Ruoyu Xie | Fahim Faisal | Antonios Anastasopoulos

This report describes GMU’s sentiment analysis system for the SemEval-2023 shared task AfriSenti-SemEval. We participated in all three sub-tasks: Monolingual, Multilingual, and Zero-Shot. Our approach uses models initialized with AfroXLMR-large, a pre-trained multilingual language model trained on African languages and fine-tuned correspondingly. We also introduce augmented training data along with original training data. Alongside finetuning, we perform phylogeny-based adapter-tuning to create several models and ensemble the best models for the final submission. Our system achieves the best F1-score on track 5: Amharic, with 6.2 points higher F1-score than the second-best performing system on this track. Overall, our system ranks 5th among the 10 systems participating in all 15 tracks.

pdf abs
Silp_nlp at SemEval-2023 Task 2: Cross-lingual Knowledge Transfer for Mono-lingual Learning
Sumit Singh | Uma Tiwary

Our team silp_nlp participated in SemEval2023 Task 2: MultiCoNER II. Our work made systems for 11 mono-lingual tracks. For leveraging the advantage of all track knowledge we chose transformer-based pretrained models, which have strong cross-lingual transferability. Hence our model trained in two stages, the first stage for multi-lingual learning from all tracks and the second for fine-tuning individual tracks. Our work highlights that the knowledge of all tracks can be transferred to an individual track if the baseline language model has crosslingual features. Our system positioned itself in the top 10 for 4 tracks by scoring 0.7432 macro F1 score for the Hindi track ( 7th rank ) and 0.7322 macro F1 score for the Bangla track ( 9th rank ).

pdf abs
TechSSN at SemEval-2023 Task 12: Monolingual Sentiment Classification in Hausa Tweets
Nishaanth Ramanathan | Rajalakshmi Sivanaiah | Angel Deborah S | Mirnalinee Thanka Nadar Thanagathai

This paper elaborates on our work in designing a system for SemEval 2023 Task 12: AfriSentiSemEval, which involves sentiment analysis for low-resource African languages using the Twitter dataset. We utilised a pre-trained model to perform sentiment classification in Hausa language tweets. We used a multilingual version of the roBERTa model, which is pretrained on 100 languages, to classify sentiments in Hausa. To tokenize the text, we used the AfriBERTa model, which is specifically pretrained on African languages.

Using pre-trained language models to implement classifiers from small to modest amounts of training data is an area of active research. The ability of large language models to generalize from few-shot examples and to produce strong classifiers is extended using the engineering approach of parameter-efficient tuning. Using the Explainable Detection of Online Sexism (EDOS) training data and a small number of trainable weights to create a tuned prompt vector, a competitive model for this task was built, which was top-ranked in Subtask B.

pdf abs
Clark Kent at SemEval-2023 Task 5: SVMs, Transformers, and Pixels for Clickbait Spoiling
Dragos-stefan Mihalcea | Sergiu Nisioi

In this paper we present an analysis of our approaches for the 2023 SemEval-2023 Clickbait Challenge. We only participated in the sub-task aiming at identifying different clikcbait spoiling types comparing several machine learning and deep learning approaches. Our analysis confirms previous results on this task and show that automatic methods are able to reach approximately 70\% accuracy at predicting what type of additional content is needed to mitigate sensationalistic posts on social media. Furthermore, we provide a qualitative analysis of the results, showing that the models may do better in practice than the metric indicates since the evaluate does not depend only on the predictor, but also on the typology we choose to define clickbait spoiling.

pdf abs
Team JUSTR00 at SemEval-2023 Task 3: Transformers for News Articles Classification
Ahmed Al-Qarqaz | Malak Abdullah

The SemEval-2023 Task 3 competition offers participants a multi-lingual dataset with three schemes one for each subtask. The competition challenges participants to construct machine learning systems that can categorize news articles based on their nature and style of writing. We esperiment with many state-of-the-art transformer-based language models proposed in the natural language processing literature and report the results of the best ones. Our top performing model is based on a transformer called “Longformer” and has achieved an F1-Micro score of 0.256 on the English version of subtask-1 and F1-Macro of 0.442 on subtask-2 on the test data. We also experiment with a number of state-of-the-art multi-lingual transformer-based models and report the results of the best performing ones.

pdf abs
Sam Miller at SemEval-2023 Task 5: Classification and Type-specific Spoiler Extraction Using XLNET and Other Transformer Models
Pia Störmer | Tobias Esser | Patrick Thomasius

This paper proposes an approach to classify andan approach to generate spoilers for clickbaitarticles and posts. For the spoiler classification,XLNET was trained to fine-tune a model. Withan accuracy of 0.66, 2 out of 3 spoilers arepredicted accurately. The spoiler generationapproach involves preprocessing the clickbaittext and post-processing the output to fit thespoiler type. The approach is evaluated on atest dataset of 1000 posts, with the best resultfor spoiler generation achieved by fine-tuninga RoBERTa Large model with a small learningrate and sample size, reaching a BLEU scoreof 0.311. The paper provides an overview ofthe models and techniques used and discussesthe experimental setup.

pdf abs
DUTH at SemEval-2023 Task 9: An Ensemble Approach for Twitter Intimacy Analysis
Giorgos Arampatzis | Vasileios Perifanis | Symeon Symeonidis | Avi Arampatzis

This work presents the approach developed by the DUTH team for participating in the SemEval-2023 Task 9: Multilingual Tweet Intimacy Analysis. Our results show that pre-processing techniques do not affect the learning performance for the task of multilingual intimacy analysis. In addition, we show that fine-tuning a transformer-based model does not provide advantages over using the pre-trained model to generate text embeddings and using the resulting representations to train simpler and more efficient models such as MLP. Finally, we utilize an ensemble of classifiers, including three MLPs with different architectures and a CatBoost model, to improve the regression accuracy.

pdf abs
SSS at SemEval-2023 Task 10: Explainable Detection of Online Sexism using Majority Voted Fine-Tuned Transformers
Sriya Rallabandi | Sanchit Singhal | Pratinav Seth

This paper describes our submission to Task 10 at SemEval 2023-Explainable Detection of Online Sexism (EDOS), divided into three subtasks. The recent rise in social media platforms has seen an increase in disproportionate levels of sexism experienced by women on social media platforms. This has made detecting and explaining online sexist content more important than ever to make social media safer and more accessible for women. Our approach consists of experimenting and finetuning BERT-based models and using a Majority Voting ensemble model that outperforms individual baseline model scores. Our system achieves a macro F1 score of 0.8392 for Task A, 0.6092 for Task B, and 0.4319 for Task C.

pdf abs
QCRI at SemEval-2023 Task 3: News Genre, Framing and Persuasion Techniques Detection Using Multilingual Models
Maram Hasanain | Ahmed El-Shangiti | Rabindra Nath Nandi | Preslav Nakov | Firoj Alam

Misinformation spreading in mainstream and social media has been misleading users in different ways. Manual detection and verification efforts by journalists and fact-checkers can no longer cope with the great scale and quick spread of misleading information. This motivated research and industry efforts to develop systems for analyzing and verifying news spreading online. The SemEval-2023 Task 3 is an attempt to address several subtasks under this overarching problem, targeting writing techniques used in news articles to affect readers’ opinions. The task addressed three subtasks with six languages, in addition to three “surprise” test languages, resulting in 27 different test setups. This paper describes our participating system to this task. Our team is one of the 6 teams that successfully submitted runs for all setups. The official results show that our system is ranked among the top 3 systems for 10 out of the 27 setups.

pdf abs
ResearchTeam_HCN at SemEval-2023 Task 6: A knowledge enhanced transformers based legal NLP system
Dhanachandra Ningthoujam | Pinal Patel | Rajkamal Kareddula | Ramanand Vangipuram

This paper presents our work on LegalEval (understanding legal text), one of the tasks in SemEval-2023. It comprises of three sub-tasks namely Rhetorical Roles (RR), Legal Named Entity Recognition (L-NER), and Court Judge- ment Prediction with Explanation (CJPE). We developed different deep-learning models for each sub-tasks. For RR, we developed a multi- task learning model with contextual sequential sentence classification as the main task and non- contextual single sentence prediction as the sec- ondary task. Our model achieved an F1-score of 76.50% on the unseen test set, and we at- tained the 14th position on the leaderboard. For the L-NER problem, we have designed a hybrid model, consisting of a multi-stage knowledge transfer learning framework and a rule-based system. This model achieved an F1-score of 91.20% on the blind test set and attained the top position on the final leaderboard. Finally, for the CJPE task, we used a hierarchical ap- proach and could get around 66.67% F1-score on judgment prediction and 45.83% F1-score on the explainability of the CJPE task, and we attained 8th position on the leaderboard for this sub-task.

pdf abs
LSJSP at SemEval-2023 Task 2: FTBC: A FastText based framework with pre-trained BERT for NER
Shilpa Chatterjee | Leo Evenss | Pramit Bhattacharyya | Joydeep Mondal

This study introduces the system submitted to the SemEval 2022 Task 2: MultiCoNER II (Multilingual Complex Named Entity Recognition) by the LSJSP team. We propose FTBC, a FastText-based framework with pre-trained Bert for NER tasks with complex entities and over a noisy dataset. Our system achieves an average of 58.27% F1 score (fine-grained) and 75.79% F1 score (coarse-grained) across all languages. FTBC outperforms the baseline BERT-CRF model on all 12 monolingual tracks.

The web contains an abundance of user- generated content. While this content is useful for many applications, it poses many challenges due to the presence of offensive, biased, and overall toxic language. In this work, we present a system that identifies and classifies sexist content at different levels of granularity. Using transformer-based models, we explore the value of data augmentation, use of ensemble methods, and leverage in-context learning using foundation models to tackle the task. We evaluate the different components of our system both quantitatively and qualitatively. Our best systems achieve an F1 score of 0.84 for the binary classification task aiming to identify whether a given content is sexist or not and 0.64 and 0.47 for the two multi-class tasks that aim to identify the coarse and fine-grained types of sexism present in the given content respectively.

pdf abs
Rahul Patil at SemEval-2023 Task 1: V-WSD: Visual Word Sense Disambiguation
Rahul Patil | Pinal Patel | Charin Patel | Mangal Verma

Semeval 2023 task 1: VWSD, In this paper, we propose an ensemble of two Neural network systems that ranks 10 images given a word and limited textual context. We have used openAI Clip based models for the English language and multilingual text-to-text translation models for Farsi-to-English and Italian-to-English. Additionally, we propose a system that learns from multilingual bert-base embeddings for text and resnet101 embeddings for the image. Considering all the three languages into account this system has achieved the fourth rank.

pdf abs
PoSh at SemEval-2023 Task 10: Explainable Detection of Online Sexism
Shruti Sriram | Padma Pooja Chandran | Shrijith M R

To precisely identify the different forms of online sexism, we utilize several sentence transformer models such as ALBERT, BERT, RoBERTa, DistilBERT, and XLNet. By combining the predictions from these models, we can generate a more comprehensive and improved result. Each transformer model is trained after pre-processing the data from the training dataset, ensuring that the models are effective at detecting and classifying instances of online sexism. For Task A, the model had to classify the texts as sexist or not sexist. We implemented ALBERT, an NLP-based sentence transformer. For task B, we implemented BERT, RoBERTa, DistilBERT and XLNet and took the mode of predictions for each text as the final prediction for the given text. For task C, we implemented ALBERT, BERT, RoBERTa, DistilBERT and XLNet and took the mode of predictions as the final prediction for the given text.

pdf abs
Legal_try at SemEval-2023 Task 6: Voting Heterogeneous Models for Entities identification in Legal Documents
Junzhe Zhao | Yingxi Wang | Nicolay Rusnachenko | Huizhi Liang

Named Entity Recognition (NER) is a subtask of Natural Language Processing (NLP) that involves identifying and categorizing named entities. The result annotation makes unstructured natural texts applicable for other NLP tasks, including information retrieval, question answering, and machine translation. NER is also essential in legal as an initial stage in extracting relevant entities. However, legal texts contain domain-specific named entities, such as applicants, defendants, courts, statutes, and articles. The latter makes standard named entity recognizers incompatible with legal documents. This paper proposes an approach combining multiple models’ results via a voting mechanism for unique entity identification in legal texts. This endeavor focuses on extracting legal named entities, and the specific assignment (task B) is to create a legal NER system for unique entity annotation in legal documents. The results of our experiments and system implementation are published in https://github.com/SuperEDG/Legal_Project.

pdf abs
MDC at SemEval-2023 Task 7: Fine-tuning Transformers for Textual Entailment Prediction and Evidence Retrieval in Clinical Trials
Robert Bevan | Oisín Turbitt | Mouhamad Aboshokor

We present our entry to the Multi-evidence Natural Language Inference for Clinical Trial Datatask at SemEval 2023. We submitted entries forboth the evidence retrieval and textual entailment sub-tasks. For the evidence retrieval task,we fine-tuned the PubMedBERT transformermodel to extract relevant evidence from clinicaltrial data given a hypothesis concerning either asingle clinical trial or pair of clinical trials. Ourbest performing model achieved an F1 scoreof 0.804. For the textual entailment task, inwhich systems had to predict whether a hypothesis about either a single clinical trial or pair ofclinical trials is true or false, we fine-tuned theBioLinkBERT transformer model. We passedour evidence retrieval model’s output into ourtextual entailment model and submitted its output for the evaluation. Our best performingmodel achieved an F1 score of 0.695.

This paper describes our submission to the SemEval-2023 for Task 6 on LegalEval: Understanding Legal Texts. Our submission concentrated on three subtasks: Legal Named Entity Recognition (L-NER) for Task-B, Legal Judgment Prediction (LJP) for Task-C1, and Court Judgment Prediction with Explanation (CJPE) for Task-C2. We conducted various experiments on these subtasks and presented the results in detail, including data statistics and methodology. It is worth noting that legal tasks, such as those tackled in this research, have been gaining importance due to the increasing need to automate legal analysis and support. Our team obtained competitive rankings of 15th, 11th, and 1st in Task-B, Task-C1, and Task-C2, respectively, as reported on the leaderboard.

pdf abs
ChaPat at SemEval-2023 Task 9: Text Intimacy Analysis using Ensembles of Multilingual Transformers
Tanmay Chavan | Ved Patwardhan

Intimacy estimation of a given text has recently gained importance due to the increase in direct interaction of NLP systems with humans. Intimacy is an important aspect of natural language and has a substantial impact on our everyday communication. Thus the level of intimacy can provide us with deeper insights and richer semantics of conversations. In this paper, we present our work on the SemEval shared task 9 on predicting the level of intimacy for the given text. The dataset consists of tweets in ten languages, out of which only six are available in the training dataset. We conduct several experiments and show that an ensemble of multilingual models along with a language-specific monolingual model has the best performance. We also evaluate other data augmentation methods such as translation and present the results. Lastly, we study the results thoroughly and present some noteworthy insights into this problem.

Detecting harmful content on social media plat-forms is crucial in preventing the negative ef-fects these posts can have on social media users. This paper presents our methodology for tack-ling task 10 from SemEval23, which focuseson detecting and classifying online sexism insocial media posts. We constructed our solu-tion using an ensemble of transformer-basedmodels (that have been fine-tuned; BERTweet,RoBERTa, and DeBERTa). To alleviate the var-ious issues caused by the class imbalance inthe dataset provided and improve the general-ization of our model, our framework employsdata augmentation and semi-supervised learn-ing. Specifically, we use back-translation fordata augmentation in two scenarios: augment-ing the underrepresented class and augment-ing all classes. In this study, we analyze theimpact of these different strategies on the sys-tem’s overall performance and determine whichtechnique is the most effective. Extensive ex-periments demonstrate the efficacy of our ap-proach. For sub-task A, the system achievedan F1-score of 0.8613. The source code to re-produce the proposed solutions is available onGithub

pdf abs
tmn at SemEval-2023 Task 9: Multilingual Tweet Intimacy Detection Using XLM-T, Google Translate, and Ensemble Learning
Anna Glazkova

The paper describes a transformer-based system designed for SemEval-2023 Task 9: Multilingual Tweet Intimacy Analysis. The purpose of the task was to predict the intimacy of tweets in a range from 1 (not intimate at all) to 5 (very intimate). The official training set for the competition consisted of tweets in six languages (English, Spanish, Italian, Portuguese, French, and Chinese). The test set included the given six languages as well as external data with four languages not presented in the training set (Hindi, Arabic, Dutch, and Korean). We presented a solution based on an ensemble of XLM-T, a multilingual RoBERTa model adapted to the Twitter domain. To improve the performance on unseen languages, each tweet was supplemented by its English translation. We explored the effectiveness of translated data for the languages seen in fine-tuning compared to unseen languages and estimated strategies for using translated data in transformer-based models. Our solution ranked 4th on the leaderboard while achieving an overall Pearson’s r of 0.5989 over the test set. The proposed system improves up to 0.088 Pearson’s r over a score averaged across all 45 submissions.

pdf abs
JudithJeyafreeda at SemEval-2023 Task 10: Machine Learning for Explainable Detection of Online Sexism
Judith Jeyafreeda Andrew

The rise of the internet and social media platforms has brought about significant changes in how people interact with each another. For a lot of people, the internet have also become the only source of news and information about the world. Thus due to the increase in accessibility of information, online sexism has also increased. Efforts should be made to make the internet a safe space for everyone, irrespective of gender, both from a larger social norms perspective and legal or technical regulations to help alleviate online gender-based violence. As a part of this, this paper explores simple methods that can be easily deployed to automatically detect online sexism in textual statements.

We study the influence of different activation functions in the output layer of pre-trained transformer models for soft and hard label prediction in the learning with disagreement task. In this task, the goal is to quantify the amount of disagreement via predicting soft labels. To predict the soft labels, we use BERT-based preprocessors and encoders and vary the activation function used in the output layer, while keeping other parameters constant. The soft labels are then used for the hard label prediction. The activation functions considered are sigmoid as well as a step-function that is added to the model post-training and a sinusoidal activation function, which is introduced for the first time in this paper.

pdf abs
IXA/Cogcomp at SemEval-2023 Task 2: Context-enriched Multilingual Named Entity Recognition Using Knowledge Bases
Iker García-Ferrero | Jon Ander Campos | Oscar Sainz | Ander Salaberria | Dan Roth

Named Entity Recognition (NER) is a core natural language processing task in which pre-trained language models have shown remarkable performance. However, standard benchmarks like CoNLL 2003 do not address many of the challenges that deployed NER systems face, such as having to classify emerging or complex entities in a fine-grained way. In this paper we present a novel NER cascade approach comprising three steps: first, identifying candidate entities in the input sentence; second, linking the each candidate to an existing knowledge base; third, predicting the fine-grained category for each entity candidate. We empirically demonstrate the significance of external knowledge bases in accurately classifying fine-grained and emerging entities. Our system exhibits robust performance in the MultiCoNER2 shared task, even in the low-resource language setting where we leverage knowledge bases of high-resource languages.

pdf abs
ACCEPT at SemEval-2023 Task 3: An Ensemble-based Approach to Multilingual Framing Detection
Philipp Heinisch | Moritz Plenz | Anette Frank | Philipp Cimiano

This paper describes the system and experimental results of an ensemble-based approach tomultilingual framing detection for the submission of the ACCEPT team to the SemEval-2023 Task 3 on Framing Detection (Subtask 2). The approach is based on an ensemble that combines three different methods: a classifier based on large language models, a classifier based on static word embeddings, and an approach that uses external commonsense knowledge graphs, in particular, ConceptNet. The results of the three classification heads are aggregated into an overall prediction for each frame class. Our best submission yielded a micro F1-score of 50.69% (rank 10) and a macro F1-score of 50.20% (rank 3) for English articles. Our experimental results show that static word embeddings and knowledge graphs are useful components for frame detection, while the ensemble of all three methods combines the strengths of our three proposed methods. Through system ablations, we show that the commonsenseguided knowledge graphs are the outperforming method for many languages.

pdf abs
Noam Chomsky at SemEval-2023 Task 4: Hierarchical Similarity-aware Model for Human Value Detection
Sumire Honda | Sebastian Wilharm

This paper presents a hierarchical similarity-aware approach for the SemEval-2023 task 4 human value detection behind arguments using SBERT. The approach takes similarity score as an additional source of information between the input arguments and the lower level of labels in a human value hierarchical dataset. Our similarity-aware model improved the similarity-agnostic baseline model, especially showing a significant increase in or the value categories with lowest scores by the baseline model.

pdf abs
NLP-Titan at SemEval-2023 Task 6: Identification of Rhetorical Roles Using Sequential Sentence Classification
Harsh Kataria | Ambuje Gupta

The analysis of legal cases poses a considerable challenge for researchers, practitioners, and academicians due to the lengthy and intricate nature of these documents. Developing countries such as India are experiencing a significant increase in the number of pending legal cases, which are often unstructured and difficult to process using conventional methods. To address this issue, the authors have implemented a sequential sentence classification process, which categorizes legal documents into 13 segments, known as Rhetorical Roles. This approach enables the extraction of valuable insights from the various classes of the structured document. The performance of this approach was evaluated using the F1 score, which measures the model’s precision and recall. The authors’ approach achieved an F1 score of 0.83, which surpasses the baseline score of 0.79 established by the task organizers. The authors have combined sequential sentence classification and the SetFit method in a hierarchical manner by combining similar classes to achieve this score.

pdf abs
AdamR at SemEval-2023 Task 10: Solving the Class Imbalance Problem in Sexism Detection with Ensemble Learning
Adam Rydelek | Daryna Dementieva | Georg Groh

The Explainable Detection of Online Sexism task presents the problem of explainable sexism detection through fine-grained categorisation of sexist cases with three subtasks. Our team experimented with different ways to combat class imbalance throughout the tasks using data augmentation and loss alteration techniques. We tackled the challenge by utilising ensembles of Transformer models trained on different datasets, which are tested to find the balance between performance and interpretability. This solution ranked us in the top 40% of teams for each of the tracks.

pdf abs
I2C Huelva at SemEval-2023 Task 4: A Resampling and Transformers Approach to Identify Human Values behind Arguments
Nordin El Balima Cordero | Jacinto Mata Vázquez | Victoria Pachón Álvarez | Abel Pichardo Estevez

This paper presents the approaches proposedfor I2C Group to address the SemEval-2023Task 4: Identification of Human Values behindArguments (ValueEval)”, whose goal is to classify 20 different categories of human valuesgiven a textual argument. The dataset of thistask consists of one argument per line, including its unique argument ID, conclusion, stanceof the premise towards the conclusion and thepremise text. To indicate whether the argumentdraws or not on that category a binary indication (1 or 0) is included. Participants can submit approaches that detect one, multiple, or allof these values in arguments. The task providesan opportunity for researchers to explore theuse of automated techniques to identify humanvalues in text and has potential applications invarious domains such as social science, politics,and marketing. To deal with the imbalancedclass distribution given, our approach undersamples the data. Additionally, the three components of the argument (conclusion, stanceand premise) are used for training. The systemoutperformed the BERT baseline according toofficial evaluation metrics, achieving a f1 scoreof 0.46.

pdf abs
MLlab4CS at SemEval-2023 Task 2: Named Entity Recognition in Low-resource Language Bangla Using Multilingual Language Models
Shrimon Mukherjee | Madhusudan Ghosh | Girish | Partha Basuchowdhuri

Extracting of NERs from low-resource languages and recognizing their types is one of the important tasks in the entity extraction domain. Recently many studies have been conducted in this area of research. In our study, we introduce a system for identifying complex entities and recognizing their types from low-resource language Bangla, which was published in SemEval Task 2 MulitCoNER II 2023. For this sequence labeling task, we use a pre-trained language model built on a natural language processing framework. Our team name in this competition is MLlab4CS. Our model Muril produces a macro average F-score of 76.27%, which is a comparable result for this competition.

pdf abs
Kb at SemEval-2023 Task 3: On Multitask Hierarchical BERT Base Neural Network for Multi-label Persuasion Techniques Detection
Katarzyna Baraniak | M Sydow

This paper presents a solution for Semeval 2023 subtask3 of task 3: persuasion techniques in paragraphs detection. The aim of this task is to identify all persuasion techniques in each paragraph of a given news article. We use hierarchical multitask neural networks combined with transformers. Span detection is an auxiliary task that helps in the main task: identifying propaganda techniques. Our experiments show that if we change the index of BERT embedding from the first token of the whole input to the first token of the identified span, it can improve performance. Span and label detection can be performed using one network, so we save data and, when data is limited, we can use more of it for training.

The use of Natural Language Processing techniques in the legal domain has become established for supporting attorneys and domain experts in content retrieval and decision-making. However, understanding the legal text poses relevant challenges in the recognition of domain-specific entities and the adaptation and explanation of predictive models. This paper addresses the Legal Entity Name Recognition (L-NER) and Court judgment Prediction (CPJ) and Explanation (CJPE) tasks. The L-NER solution explores the use of various transformer-based models, including an entity-aware method attending domain-specific entities. The CJPE proposed method relies on hierarchical BERT-based classifiers combined with local input attribution explainers. We propose a broad comparison of eXplainable AI methodologies along with a novel approach based on NER. For the L-NER task, the experimental results remark on the importance of domain-specific pre-training. For CJP our lightweight solution shows performance in line with existing approaches, and our NER-boosted explanations show promising CJPE results in terms of the conciseness of the prediction explanations.

pdf abs
UO-LouTAL at SemEval-2023 Task 6: Lightweight Systems for Legal Processing
Sébastien Bosch | Louis Estève | Joanne Loo | Anne-Lyse Minard

This paper presents the work produced by students of the University of Orlans Masters in Natural Language Processing program by way of participating in SemEval Task 6, LegalEval, which aims to enhance the capabilities of legal professionals through automated systems. Two out of the three sub-tasks available – Rhetorical Role prediction (RR) and Legal Named Entity Recognition (L-NER) – were tackled, with the express intent of developing lightweight and interpretable systems. For the L-NER sub-task, a CRF model was trained, augmented with post-processing rules for some named entity types. A macro F1 score of 0.74 was obtained on the DEV set, and 0.64 on the evaluation set. As for the RR sub-task, two sentence classification systems were built: one based on the Bag-of-Words technique with L-NER system output integrated, the other using a sentence-transformer approach. Rule-based post-processing then converted the results of the sentence classification systems into RR predictions. The better-performing Bag-of-Words system obtained a macro F1 score of 0.49 on the DEV set and 0.57 on the evaluation set.

pdf abs
NLP-LTU at SemEval-2023 Task 10: The Impact of Data Augmentation and Semi-Supervised Learning Techniques on Text Classification Performance on an Imbalanced Dataset
Sana Al-Azzawi | György Kovács | Filip Nilsson | Tosin Adewumi | Marcus Liwicki

In this paper, we propose a methodology fortask 10 of SemEval23, focusing on detectingand classifying online sexism in social me-dia posts. The task is tackling a serious is-sue, as detecting harmful content on socialmedia platforms is crucial for mitigating theharm of these posts on users. Our solutionfor this task is based on an ensemble of fine-tuned transformer-based models (BERTweet,RoBERTa, and DeBERTa). To alleviate prob-lems related to class imbalance, and to improvethe generalization capability of our model, wealso experiment with data augmentation andsemi-supervised learning. In particular, fordata augmentation, we use back-translation, ei-ther on all classes, or on the underrepresentedclasses only. We analyze the impact of thesestrategies on the overall performance of thepipeline through extensive experiments. whilefor semi-supervised learning, we found thatwith a substantial amount of unlabelled, in-domain data available, semi-supervised learn-ing can enhance the performance of certainmodels. Our proposed method (for which thesource code is available on Github12) attainsan F 1-score of 0.8613 for sub-taskA, whichranked us 10th in the competition.

pdf abs
John-Arthur at SemEval-2023 Task 4: Fine-Tuning Large Language Models for Arguments Classification
Georgios Balikas

This paper presents the system submissions of the John-Arthur team to the SemEval Task 4 “ValueEval: Identification of Human Values behind Arguments”. The best system of the team was ranked 3rd and the overall rank of the team was 2nd (the first team had the two best systems). John-Arthur team models the ValueEval problem as a multi-class, multi-label text classification problem. The solutions leverage recently proposed large language models that are fine-tuned on the provided datasets. To boost the achieved performance we employ different best practises whose impact on the model performance we evaluate here. The code ispublicly available at github and the model onHuggingface hub.

pdf abs
NAP at SemEval-2023 Task 3: Is Less Really More? (Back-)Translation as Data Augmentation Strategies for Detecting Persuasion Techniques
Neele Falk | Annerose Eichel | Prisca Piccirilli

Persuasion techniques detection in news in a multi-lingual setup is non-trivial and comes with challenges, including little training data. Our system successfully leverages (back-)translation as data augmentation strategies with multi-lingual transformer models for the task of detecting persuasion techniques. The automatic and human evaluation of our augmented data allows us to explore whether (back-)translation aid or hinder performance. Our in-depth analyses indicate that both data augmentation strategies boost performance; however, balancing human-produced and machine-generated data seems to be crucial.

pdf abs
PoliTo at SemEval-2023 Task 1: CLIP-based Visual-Word Sense Disambiguation Based on Back-Translation
Lorenzo Vaiani | Luca Cagliero | Paolo Garza

Visual-Word Sense Disambiguation (V-WSD) entails resolving the linguistic ambiguity in a text by selecting a clarifying image from a set of (potentially misleading) candidates. In this paper, we address V-WSD using a state-of-the-art Image-Text Retrieval system, namely CLIP. We propose to alleviate the linguistic ambiguity across multiple domains and languages via text and image augmentation. To augment the textual content we rely on back-translation with the aid of a variety of auxiliary languages. The approach based on finetuning CLIP on the full phrases is effective in accurately disambiguating words and incorporating back-translation enhance the system’s robustness and performance on the test samples written in Indo-European languages.

pdf abs
FMI-SU at SemEval-2023 Task 7: Two-level Entailment Classification of Clinical Trials Enhanced by Contextual Data Augmentation
Sylvia Vassileva | Georgi Grazhdanski | Svetla Boytcheva | Ivan Koychev

The paper presents an approach for solving SemEval 2023 Task 7 - identifying the inference relation in a clinical trials dataset. The system has two levels for retrieving relevant clinical trial evidence for a statement and then classifying the inference relation based on the relevant sentences. In the first level, the system classifies the evidence-statement pairs as relevant or not using a BERT-based classifier and contextual data augmentation (subtask 2). Using the relevant parts of the clinical trial from the first level, the system uses an additional BERT-based classifier to determine whether the relation is entailment or contradiction (subtask 1). In both levels, the contextual data augmentation is showing a significant improvement in the F1 score on the test set of 3.7% for subtask 2 and 7.6% for subtask 1, achieving final F1 scores of 82.7% for subtask 2 and 64.4% for subtask 1.

pdf abs
ML Mob at SemEval-2023 Task 1: Probing CLIP on Visual Word-Sense Disambiguation
Clifton Poth | Martin Hentschel | Tobias Werner | Hannah Sterz | Leonard Bongard

Successful word sense disambiguation (WSD)is a fundamental element of natural languageunderstanding. As part of SemEval-2023 Task1, we investigate WSD in a multimodal setting,where ambiguous words are to be matched withcandidate images representing word senses. Wecompare multiple systems based on pre-trainedCLIP models. In our experiments, we findCLIP to have solid zero-shot performance onmonolingual and multilingual data. By em-ploying different fine-tuning techniques, we areable to further enhance performance. However,transferring knowledge between data distribu-tions proves to be more challenging.

pdf abs
Alexander Knox at SemEval-2023 Task 5: The comparison of prompting and standard fine-tuning techniques for selecting the type of spoiler needed to neutralize a clickbait
Mateusz Woźny | Mateusz Lango

Clickbait posts are a common problem on social media platforms, as they often deceive users by providing misleading or sensational headlines that do not match the content of the linked web page. The aim of this study is to create a technique for identifying the specific type of suitable spoiler - be it a phrase, a passage, or a multipart spoiler - needed to neutralize clickbait posts. This is achieved by developing a machine learning classifier analyzing both the clickbait post and the linked web page. Modern approaches for constructing a text classifier usually rely on fine-tuning a transformer-based model pre-trained on large unsupervised corpora. However, recent advances in the development of large-scale language models have led to the emergence of a new transfer learning paradigm based on prompt engineering. In this work, we study these two transfer learning techniques and compare their effectiveness for clickbait spoiler-type detection task. Our experimental results show that for this task, using the standard fine-tuning method gives better results than using prompting. The best model can achieve a similar performance to that presented by Hagen et al. (2022).

pdf abs
hhuEDOS at SemEval-2023 Task 10: Explainable Detection of Online Sexism (EDOS) Binary Sexism Detection (Subtask A)
Wiebke Petersen | Diem-Ly Tran | Marion Wroblewitz

In this paper, we describe SemEval-2023 Task 10, a shared task on detecting and predicting sexist language. The dataset consists of labeled sexist and non-sexist data targeted towards women acquired from both Reddit and Gab. We present and compare several approaches we experimented with and our final submitted model. Additional error analysis is given to recognize challenges we dealt with in our process. A total of 84 teams participated. Our model ranks 55th overall in Subtask A of the shared task.

pdf abs
Rutgers Multimedia Image Processing Lab at SemEval-2023 Task-1: Text-Augmentation-based Approach for Visual Word Sense Disambiguation
Keyi Li | Sen Yang | Chenyang Gao | Ivan Marsic

This paper describes our system used in SemEval-2023 Task-1: Visual Word Sense Disambiguation (VWSD). The VWSD task is to identify the correct image that corresponds to an ambiguous target word given limited textual context. To reduce word ambiguity and enhance image selection, we proposed several text augmentation techniques, such as prompting, WordNet synonyms, and text generation. We experimented with different vision-language pre-trained models to capture the joint features of the augmented text and image. Our approach achieved the best performance using a combination of GPT-3 text generation and the CLIP model. On the multilingual test sets, our system achieved an average hit rate (at top-1) of 51.11 and a mean reciprocal rank of 65.69.

pdf abs
Uppsala University at SemEval-2023 Task12: Zero-shot Sentiment Classification for Nigerian Pidgin Tweets
Annika Kniele | Meriem Beloucif

While sentiment classification has been considered a practically solved task for high-resource languages such as English, the scarcity of data for many languages still makes it a challenging task. The AfriSenti-SemEval shared task aims to classify sentiment on Twitter data for 14 low-resource African languages. In our participation, we focus on Nigerian Pidgin as the target language. We have investigated the effect of English monolingual and multilingual pre-trained models on the sentiment classification task for Nigerian Pidgin. Our setup includes zero-shot models (using English, Igbo and Hausa data) and a Nigerian Pidgin fine-tuned model. Our results show that English fine-tuned models perform slightly better than models fine-tuned on other Nigerian languages, which could be explained by the lexical and structural closeness between Nigerian Pidgin and English. The best results were reported on the monolingual Nigerian Pidgin data. The model pre-trained on English and fine-tuned on Nigerian Pidgin was submitted to Task A Track 4 of the AfriSenti-SemEval Shared Task 12, and scored 25 out of 32 in the ranking.

pdf abs
KDDIE at SemEval-2023 Task 2: External Knowledge Injection for Named Entity Recognition
Caleb Martin | Huichen Yang | William Hsu

This paper introduces our system for the SemEval 2023 Task 2: Multilingual Complex Named Entity Recognition (MultiCoNER II) competition. Our team focused on the sub-task of Named Entity Recognition (NER) for the language of English in the challenge and reported our results. To achieve our goal, we utilized transfer learning by fine-tuning pre-trained language models (PLMs) on the competition dataset. Our approach involved combining a BERT-based PLM with external knowledge to provide additional context to the model. In this report, we present our findings and results.

pdf abs
Bhattacharya_Lab at SemEval-2023 Task 12: A Transformer-based Language Model for Sentiment Classification for Low Resource African Languages: Nigerian Pidgin and Yoruba
Nathaniel Hughes | Kevan Baker | Aditya Singh | Aryavardhan Singh | Tharalillah Dauda | Sutanu Bhattacharya

Sentiment Analysis is an aspect of natural languageprocessing (NLP) that has been a topicof research. While most studies focus on highresourcelanguages with an extensive amountof available data, the study on low-resource languageswith insufficient data needs attention. To address this issue, we propose a transformerbasedmethod for sentiment analysis for lowresourcesAfrican languages, Nigerian Pidginand Yoruba. To evaluate the effectiveness ofour multilingual language models for monolingualsentiment classification, we participated inthe AfriSenti SemEval shared task 2023 competition. On the official e valuation s et, ourgroup (named as Bhattacharya_Lab) ranked1 out of 33 participating groups in the MonolingualSentiment Classification task (i.e., TaskA) for Nigerian Pidgin (i.e., Track 4), and inthe Top 5 among 33 participating groups inthe Monolingual Sentiment Classification taskfor Yoruba (i.e., Track 2) respectively, demonstratingthe potential for our transformer-basedlanguage models to improve sentiment analysisin low-resource languages. Overall, ourstudy highlights the importance of exploringthe potential of NLP in low-resource languagesand the impact of transformer-based multilinguallanguage models in sentiment analysis forthe low-resource African languages, NigerianPidgin and Yoruba.

pdf abs
Seals_Lab at SemEval-2023 Task 12: Sentiment Analysis for Low-resource African Languages, Hausa and Igbo
Nilanjana Raychawdhary | Amit Das | Gerry Dozier | Cheryl D. Seals

One of the most extensively researched applications in natural language processing (NLP) is sentiment analysis. While the majority of the study focuses on high-resource languages (e.g., English), this research will focus on low-resource African languages namely Igbo and Hausa. The annotated tweets of both languages have a significant number of code-mixed tweets. The curated datasets necessary to build complex AI applications are not available for the majority of African languages. To optimize the use of such datasets, research is needed to determine the viability of present NLP procedures as well as the development of novel techniques. This paper outlines our efforts to develop a sentiment analysis (for positive and negative as well as neutral) system for tweets from the Hausa, and Igbo languages. Sentiment analysis can computationally analyze and discover sentiments in a text or document. We worked on the first thorough compilation of AfriSenti-SemEval 2023 Shared Task 12 Twitter datasets that are human-annotated for the most widely spoken languages in Nigeria, such as Hausa and Igbo. Here we trained the modern pre-trained language model AfriBERTa large on the AfriSenti-SemEval Shared Task 12 Twitter dataset to create sentiment classification. In particular, the results demonstrate that our model trained on AfriSenti-SemEval Shared Task 12 datasets and produced with an F1 score of 80.85% for Hausa and 80.82% for Igbo languages on the sentiment analysis test. In AfriSenti-SemEval 2023 shared task 12 (Task A), we consistently ranked top 10 by achieving a mean F1 score of more than 80% for both the Hausa and Igbo languages.

pdf abs
FIT BUT at SemEval-2023 Task 12: Sentiment Without Borders - Multilingual Domain Adaptation for Low-Resource Sentiment Classification
Maksim Aparovich | Santosh Kesiraju | Aneta Dufkova | Pavel Smrz

This paper presents our proposed method for SemEval-2023 Task 12, which focuses on sentiment analysis for low-resource African languages. Our method utilizes a language-centric domain adaptation approach which is based on adversarial training, where a small version of Afro-XLM-Roberta serves as a generator model and a feed-forward network as a discriminator. We participated in all three subtasks: monolingual (12 tracks), multilingual (1 track), and zero-shot (2 tracks). Our results show an improvement in weighted F1 for 13 out of 15 tracks with a maximum increase of 4.3 points for Moroccan Arabic compared to the baseline. We observed that using language family-based labels along with sequence-level input representations for the discriminator model improves the quality of the cross-lingual sentiment analysis for the languages unseen during the training. Additionally, our experimental results suggest that training the system on languages that are close in a language families tree enhances the quality of sentiment analysis for low-resource languages. Lastly, the computational complexity of the prediction step was kept at the same level which makes the approach to be interesting from a practical perspective. The code of the approach can be found in our repository.

pdf abs
WKU_NLP at SemEval-2023 Task 9: Translation Augmented Multilingual Tweet Intimacy Analysis
Qinyuan Zheng

This paper describes a system for the SemEval 2023 Task 9: Multilingual Tweet Intimacy Analysis. This system consists of a pretrained multilingual masked language model as a text encoder and a neural network as a regression model. Data augmentation based on neural machine translation models is adopted to improve model performance under the low-resource scenario. This system is further improved through the ensemble of multiple models with the best performance in each language. This system ranks 4th in languages unseen in the training data and 16th in languages seen in the training data. The code and data can be found in this link: https://github.com/Cloudy0219/Multilingual.

pdf abs
PanwarJayant at SemEval-2023 Task 10: Exploring the Effectiveness of Conventional Machine Learning Techniques for Online Sexism Detection
Jayant Panwar | Radhika Mamidi

The rapid growth of online communication using social media platforms has led to an increase in the presence of hate speech, especially in terms of sexist language online. The proliferation of such hate speech has a significant impact on the mental health and well-being of the users and hence the need for automated systems to detect and filter such texts. In this study, we explore the effectiveness of conventional machine learning techniques for detecting sexist text. We explore five conventional classifiers, namely, Logistic Regression, Decision Tree, XGBoost, Support Vector Machines, and Random Forest. The results show that different classifiers perform differently on each task due to their different inherent architectures which may be suited to a certain problem more. These models are trained on the shared task dataset, which includes both sexist and non-sexist texts. All in all, this study explores the potential of conventional machine learning techniques in detecting online sexist content. The results of this study highlight the strengths and weaknesses of all classifiers with respect to all subtasks. The results of this study will be useful for researchers and practitioners interested in developing systems for detecting or filtering online hate speech.

pdf abs
DN at SemEval-2023 Task 12: Low-Resource Language Text Classification via Multilingual Pretrained Language Model Fine-tuning
Daniil Homskiy | Narek Maloyan

In our work, a model is implemented that solves the task, based on multilingual pre-trained models. We also consider various methods of data preprocessing

pdf abs
Billie-Newman at SemEval-2023 Task 5: Clickbait Classification and Question Answering with Pre-Trained Language Models, Named Entity Recognition and Rule-Based Approaches
Andreas Kruff | Anh Huy Tran

In this paper, we describe the implementations of our systems for the SemEval-2023 Task 5 ‘Clickbait Spoiling’, which involves the classification of clickbait posts in sub-task 1 and the spoiler generation and question answering of clickbait posts in sub-task 2, ultimately achieving a balanced accuracy of 0.593 and a BLEU score of 0.322 on the test datasets in sub-task 1 and sub-task 2 respectively. For this, we propose the usage of RoBERTa transformer models and modify them for each specific downstream task. In sub-task 1, we use the pre-trained RoBERTa model and use it in conjunction with NER, a spoiler-title ratio, a regex check for enumerations and lists and the use of input reformulation. In sub-task 2, we propose the usage of the RoBERTa-SQuAD2.0 model for extractive question answering in combination with a contextual rule-based approach for multi-type spoilers in order to generate spoiler answers.

Nowadays, persuasive messages are more and more frequent in social networks, which generates great concern in several communities, given that persuasion seeks to guide others towards the adoption of ideas, attitudes or actions that they consider to be beneficial to themselves. The efficient detection of news genre categories, detection of framing and detection of persuasion techniques requires several scientific disciplines, such as computational linguistics and sociology. Here we illustrate how we use lexical features given a news article, determine whether it is an opinion piece, aims to report factual news, or is satire. This paper presents a novel strategy for news based on Lexical Weirdness. The results are part of our participation in subtasks 1 and 2 in SemEval 2023 Task 3.

pdf abs
CLaC at SemEval-2023 Task 2: Comparing Span-Prediction and Sequence-Labeling Approaches for NER
Harsh Verma | Sabine Bergler

This paper summarizes the CLaC submission for the MultiCoNER 2 task which concerns the recognition of complex, fine-grained named entities. We compare two popular approaches for NER, namely SequenceLabeling and Span Prediction. We find that our best Span Prediction system performs slightly better than our best Sequence Labeling system on test data. Moreover, we find that using the larger version of XLM RoBERTa significantly improves performance. Post-competition experiments show that Span Prediction and Sequence Labeling approaches improve when they use special input tokens ([s] and [/s]) of XLM-RoBERTa. The code for training all models, preprocessing, and post-processing is available at https://github.com/harshshredding/semeval2023-multiconer-paper.

pdf abs
CL-UZH at SemEval-2023 Task 10: Sexism Detection through Incremental Fine-Tuning and Multi-Task Learning with Label Descriptions
Janis Goldzycher

The widespread popularity of social media has led to an increase in hateful, abusive, and sexist language, motivating methods for the automatic detection of such phenomena. The goal of the SemEval shared task Towards Explainable Detection of Online Sexism (EDOS 2023) is to detect sexism in English social media posts (subtask A), and to categorize such posts into four coarse-grained sexism categories (subtask B), and eleven fine-grained subcategories (subtask C). In this paper, we present our submitted systems for all three subtasks, based on a multi-task model that has been fine-tuned on a range of related tasks and datasets before being fine-tuned on the specific EDOS subtasks. We implement multi-task learning by formulating each task as binary pairwise text classification, where the dataset and label descriptions are given along with the input text. The results show clear improvements over a fine-tuned DeBERTa-V3 serving as a baseline leading to F1-scores of 85.9% in subtask A (rank 13/84), 64.8% in subtask B (rank 19/69), and 44.9% in subtask C (26/63).

pdf abs
LCT-1 at SemEval-2023 Task 10: Pre-training and Multi-task Learning for Sexism Detection and Classification
Konstantin Chernyshev | Ekaterina Garanina | Duygu Bayram | Qiankun Zheng | Lukas Edman

Misogyny and sexism are growing problems in social media. Advances have been made in online sexism detection but the systems are often uninterpretable. SemEval-2023 Task 10 on Explainable Detection of Online Sexism aims at increasing explainability of the sexism detection, and our team participated in all the proposed subtasks. Our system is based on further domain-adaptive pre-training. Building on the Transformer-based models with the domain adaptation, we compare fine-tuning with multi-task learning and show that each subtask requires a different system configuration. In our experiments, multi-task learning performs on par with standard fine-tuning for sexism detection and noticeably better for coarse-grained sexism classification, while fine-tuning is preferable for fine-grained classification.

pdf abs
DSHacker at SemEval-2023 Task 3: Genres and Persuasion Techniques Detection with Multilingual Data Augmentation through Machine Translation and Text Generation
Arkadiusz Modzelewski | Witold Sosnowski | Magdalena Wilczynska | Adam Wierzbicki

In our article, we present the systems developed for SemEval-2023 Task 3, which aimed to evaluate the ability of Natural Language Processing (NLP) systems to detect genres and persuasion techniques in multiple languages. We experimented with several data augmentation techniques, including machine translation (MT) and text generation. For genre detection, synthetic texts for each class were created using the OpenAI GPT-3 Davinci language model. In contrast, to detect persuasion techniques, we relied on augmenting the dataset through text translation using the DeepL translator. Fine-tuning the models using augmented data resulted in a top-ten ranking across all languages, indicating the effectiveness of the approach. The models for genre detection demonstrated excellent performance, securing the first, second, and third positions in Spanish, German, and Italian, respectively. Moreover, one of the models for persuasion techniques’ detection secured the third position in Polish. Our contribution constitutes the system architecture that utilizes DeepL and GPT-3 for data augmentation for the purpose of detecting both genre and persuasion techniques.

pdf abs
GPL at SemEval-2023 Task 1: WordNet and CLIP to Disambiguate Images
Shibingfeng Zhang | Shantanu Nath | Davide Mazzaccara

Given a word in context, the task of VisualWord Sense Disambiguation consists of select-ing the correct image among a set of candidates. To select the correct image, we propose a so-lution blending text augmentation and multi-modal models. Text augmentation leverages thefine-grained semantic annotation from Word-Net to get a better representation of the tex-tual component. We then compare this sense-augmented text to the set of image using pre-trained multimodal models CLIP and ViLT. Oursystem has been ranked 16th for the Englishlanguage, achieving 68.5 points for hit rate and79.2 for mean reciprocal rank.

pdf abs
Clemson NLP at SemEval-2023 Task 7: Applying GatorTron to Multi-Evidence Clinical NLI
Ahamed Alameldin | Ashton Williamson

This paper presents our system descriptions for SemEval 2023-Task 7: Multi-evidence Natural Language Inference for Clinical Trial Data sub-tasks one and two. Provided with a collection of Clinical Trial Reports (CTRs) and corresponding expert-annotated claim statements, sub-task one involves determining an inferential relationship between the statement and CTR premise: contradiction or entailment. Sub-task two involves retrieving evidence from the CTR which is necessary to determine the entailment in sub-task one. For sub-task two we employ a recent transformer-based language model pretrained on biomedical literature, which we domain-adapt on a set of clinical trial reports. For sub-task one, we take an ensemble approach in which we leverage the evidence retrieval model from sub-task two to extract relevant sections, which are then passed to a second model of equivalent architecture to determine entailment. Our system achieves a ranking of seventh on sub-task one with an F1-score of 0.705 and sixth on sub-task two with an F1-score of 0.806. In addition, we find that the high rate of success of language models on this dataset may be partially attributable to the existence of annotation artifacts.

In this paper, we describe the multi strategy system for SemEval-2022 Task 7, This task aims to determine whether a given statement is supported by one or two Clinical Trial reports, and to identify evidence that supports the statement. This is a task that requires high natural language inference capabilities. In Subtask 1, we compare our strategy based on prompt learning and ChatGPT with a baseline constructed using BERT in zero-shot setting, and validate the effectiveness of our strategy. In Subtask 2, we fine-tune DeBERTaV3 for classification without relying on the results from Subtask 1, and we observe that early stopping can effectively prevent model overfitting, which performs well in Subtask 2. In addition, we did not use any ensemble strategies. Ultimately, we achieved the 10th place in Subtask 1 and the 2nd place in Subtask 2.

pdf abs
Quintilian at SemEval-2023 Task 4: Grouped BERT for Multi-Label Classification
Ajay Narasimha Mopidevi | Hemanth Chenna

In this paper, we initially discuss about the ValueEval task and the challenges involved in multi-label classification tasks. We tried to approach this task using Natural Language Inference and proposed a Grouped-BERT architecture which leverages commonality between the classes for a multi-label classification tasks.

pdf abs
CLaC at SemEval-2023 Task 3: Language Potluck RoBERTa Detects Online Persuasion Techniques in a Multilingual Setup
Nelson Filipe Costa | Bryce Hamilton | Leila Kosseim

This paper presents our approach to the SemEval-2023 Task 3 to detect online persuasion techniques in a multilingual setup. Our classification system is based on the RoBERTa-base model trained predominantly on English to label the persuasion techniques across 9 different languages. Our system was able to significantly surpass the baseline performance in 3 of the 9 languages: English, Georgian and Greek. However, our wrong assumption that a single classification system trained predominantly on English could generalize well to other languages, negatively impacted our scores on the other 6 languages. In this paper, we provide a description of the reasoning behind the development of our final model and what conclusions may be drawn from its performance for future work.

pdf abs
YNUNLP at SemEval-2023 Task 2: The Pseudo Twin Tower Pre-training Model for Chinese Named Entity Recognition
Jing Li | Xiaobing Zhou

This paper introduces our method in the system for SemEval 2023 Task 2: MultiCoNER II Multilingual Complex Named Entity Recognition, Track 9-Chinese. This task focuses on detecting fine-grained named entities whose data set has a fine-grained taxonomy of 36 NE classes, representing a realistic challenge for NER. In this task, we need to identify entity boundaries and category labels for the six identified categories. We use BERT embedding to represent each character in the original sentence and train CRF-Rdrop to predict named entity categories using the data set provided by the organizer. Our best submission, with a macro average F1 score of 0.5657, ranked 15th out of 22 teams.

pdf abs
Mr-wallace at SemEval-2023 Task 5: Novel Clickbait Spoiling Algorithm Using Natural Language Processing
Vineet Saravanan | Steven Wilson

This paper presents a model for clickbait spoiling,which aims at generating short texts that satisfy thecuriosity induced by a clickbait post. The modelis split into two tasks: identifying the clickbaittype and spoiling the clickbait. The first task isto classify the spoiler type that the clickbait postwarrants, and the second task is to generate thespoiler for the clickbait post. The model utilizesthe Distilbert-base-uncased model for the first taskand the Bert-base-uncased model for the secondtask. The trained model is optimized through trialand error on different model selections, and hyper-parameters and results are presented in a confusionmatrix. The main reason we utilized Distilbert-base-uncased is that it analyzes words in the con-text of what’s around it. The objective of this modelis to save readers time and spoil the clickbait of dif-ferent articles they may see on different platformslike Twitter and Reddit

pdf abs
I2R at SemEval-2023 Task 7: Explanations-driven Ensemble Approach for Natural Language Inference over Clinical Trial Data
Saravanan Rajamanickam | Kanagasabai Rajaraman

In this paper, we describe our system for SemEval-2023 Task 7: Multi-evidence Natural Language Inference for Clinical Trial Data. Given a CTR premise, and a statement, this task involves 2 sub-tasks (i) identifying the inference relation between CTR - statement pairs (Task 1: Textual Entailment), and (ii) extracting a set of supporting facts, from the premise, to justify the label predicted in Task 1 (Task 2: Evidence Retrieval). We adopt an explanations driven NLI approach to tackle the tasks. Given a statement to verify, the idea is to first identify relevant evidence from the target CTR(s), perform evidence level inferences and then ensemble them to arrive at the final inference. We have experimented with various BERT based models and T5 models. Our final model uses T5 base that achieved better performance compared to BERT models. In summary, our system achieves F1 score of 70.1% for Task 1 and 80.2% for Task 2. We ranked 8th respectively under both the tasks. Moreover, ours was one of the 5 systems that ranked within the Top 10 under both tasks.

pdf abs
NLUBot101 at SemEval-2023 Task 3: An Augmented Multilingual NLI Approach Towards Online News Persuasion Techniques Detection
Genglin Liu | Yi Fung | Heng Ji

We describe our submission to SemEval 2023 Task 3, specifically the subtask on persuasion technique detection. In this work, our team NLUBot101 tackled a novel task of classifying persuasion techniques in online news articles at a paragraph level. The low-resource multilingual datasets, along with the imbalanced label distribution, make this task challenging. Our team presented a cross-lingual data augmentation approach and leveraged a recently proposed multilingual natural language inference model to address these challenges. Our solution achieves the highest macro-F1 score for the English task, and top 5 micro-F1 scores on both the English and Russian leaderboards.

pdf abs
Alexa at SemEval-2023 Task 10: Ensemble Modeling of DeBERTa and BERT Variations for Identifying Sexist Text
Mutaz Younes | Ali Kharabsheh | Mohammad Bani Younes

This study presents an ensemble approach for detecting sexist text in the context of the Semeval-2023 task 10. Our approach leverages 18 models, including DeBERTa-v3-base models with different input sequence lengths, a BERT-based model trained on identifying hate speech, and three more models pre-trained on the task’s unlabeled data with varying input lengths. The results of our framework on the development set show an f1-score of 84.92% and on the testing set 84.55%, effectively demonstrating the strength of the ensemble approach in getting accurate results.

pdf abs
Gallagher at SemEval-2023 Task 5: Tackling Clickbait with Seq2Seq Models
Tugay Bilgis | Nimet Beyza Bozdag | Steven Bethard

This paper presents the systems and approaches of the Gallagher team for the SemEval-2023 Task 5: Clickbait Spoiling. We propose a method to classify the type of spoiler (phrase, passage, multi) and a question-answering method to generate spoilers that satisfy the curiosity caused by clickbait posts. We experiment with the state-of-the-art Seq2Seq model T5. To identify the spoiler types we used a fine-tuned T5 classifier (Subtask 1). A mixture of T5 and Flan-T5 was used to generate the spoilers for clickbait posts (Subtask 2). Our system officially ranks first in generating phrase type spoilers in Subtask 2, and achieves the highest precision score for passage type spoilers in Subtask 1.

pdf abs
Arizonans at SemEval-2023 Task 9: Multilingual Tweet Intimacy Analysis with XLM-T
Nimet Beyza Bozdag | Tugay Bilgis | Steven Bethard

This paper presents the systems and approaches of the Arizonans team for the SemEval 2023 Task 9: Multilingual Tweet Intimacy Analysis. We finetune the Multilingual RoBERTa model trained with about 200M tweets, XLM-T. Our final model ranked 9th out of 45 overall, 13th in seen languages, and 8th in unseen languages.

There are two competing approaches for modelling annotator disagreement: distributional soft-labelling approaches (which aim to capture the level of disagreement) or modelling perspectives of individual annotators or groups thereof. We adapt a multi-task architecture which has previously shown success in modelling perspectives to evaluate its performance on the SEMEVAL Task 11. We do so by combining both approaches, i.e. predicting individual annotator perspectives as an interim step towards predicting annotator disagreement. Despite its previous success, we found that a multi-task approach performed poorly on datasets which contained distinct annotator opinions, suggesting that this approach may not always be suitable when modelling perspectives. Furthermore, our results explain that while strongly perspectivist approaches might not achieve state-of-the-art performance according to evaluation metrics used by distributional approaches, our approach allows for a more nuanced understanding of individual perspectives present in the data. We argue that perspectivist approaches are preferable because they enable decision makers to amplify minority views, and that it is important to re-evaluate metrics to reflect this goal.

pdf abs
Chride at SemEval-2023 Task 10: Fine-tuned Deberta-V3 on Detection of Online Sexism with Hierarchical Loss
Letian Peng | Bosung Kim

Sexism is one of the most concerning problems in the internet society. By detecting sexist expressions, we can reduce the offense toward females and provide useful information to understand how sexism occurs. Our work focuses on a newly-published dataset, EDOS, which annotates English sexist expressions from Reddit and categorizes their specific types. Our method is to train a DeBERTaV3 classifier with all three kinds of labels provided by the dataset, including sexist, category, and granular vectors. Our classifier predicts the probability distribution on vector labels and further applies it to represent category and sexist distributions. Our classifier uses its label and finer-grained labels for each classification to calculate the hierarchical loss for optimization. Our experiments and analyses show that using a combination of loss with finer-grained labels generally achieves better performance on sexism detection and categorization. Codes for our implementation can be found at https://github.com/KomeijiForce/SemEval2023_Task10.

pdf abs
ODA_SRIB at SemEval-2023 Task 9: A Multimodal Approach for Improved Intimacy Analysis
Priyanshu Kumar | Amit Kumar | Jiban Prakash | Prabhat Lamba | Irfan Abdul

We experiment with XLM-Twitter and XLM-RoBERTa models to predict the intimacy scores in Tweets i.e. the extent to which a Tweet contains intimate content. We propose a Transformer-TabNet based multimodal architecture using text data and statistical features from the text, which performs better than the vanilla Transformer based model. We further experiment with Adversarial Weight Perturbation to make our models generalized and robust. The ensemble of four of our best models achieve an over-all Pearson Coefficient of 0.5893 on the test dataset.

The NLI4CT task aims to entail hypotheses based on Clinical Trial Reports (CTRs) and retrieve the corresponding evidence supporting the justification. This task poses a significant challenge, as verifying hypotheses in the NLI4CT task requires the integration of multiple pieces of evidence from one or two CTR(s) and the application of diverse levels of reasoning, including textual and numerical. To address these problems, we present a multi-granularity system for CTR-based textual entailment and evidence retrieval in this paper. Specifically, we construct a Multi-granularity Inference Network (MGNet) that exploits sentence-level and token-level encoding to handle both textual entailment and evidence retrieval tasks. Moreover, we enhance the numerical inference capability of the system by leveraging a T5-based model, SciFive, which is pre-trained on the medical corpus. Model ensembling and a joint inference method are further utilized in the system to increase the stability and consistency of inference. The system achieves f1-scores of 0.856 and 0.853 on textual entailment and evidence retrieval tasks, resulting in the best performance on both subtasks. The experimental results corroborate the effectiveness of our proposed method.

pdf abs
iREL at SemEval-2023 Task 10: Multi-level Training for Explainable Detection of Online Sexism
Nirmal Manoj | Sagar Joshi | Ankita Maity | Vasudeva Varma

This paper describes our approach for SemEval-2023 Task 10: Explainable Detection of Online Sexism (EDOS). The task deals with identification and categorization of sexist content into fine-grained categories for explainability in sexism classification. The explainable categorization is proposed through a set of three hierarchical tasks that constitute a taxonomy of sexist content, each task being more granular than the former for categorization of the content. Our team (iREL) participated in all three hierarchical subtasks. Considering the inter-connected task structure, we study multilevel training to study the transfer learning from coarser to finer tasks. Our experiments based on pretrained transformer architectures also make use of additional strategies such as domain-adaptive pretraining to adapt our models to the nature of the content dealt with, and use of the focal loss objective for handling class imbalances. Our best-performing systems on the three tasks achieve macro-F1 scores of 85.93, 69.96 and 54.62 on their respective validation sets.

pdf abs
DuluthNLP at SemEval-2023 Task 12: AfriSenti-SemEval: Sentiment Analysis for Low-resource African Languages using Twitter Dataset
Samuel Akrah | Ted Pedersen

This paper describes the DuluthNLP system that participated in Task 12 of SemEval-2023 on AfriSenti-SemEval: Sentiment Analysis for Low-resource African Languages using Twitter Dataset. Given a set of tweets, the task requires participating systems to classify each tweet as negative, positive or neutral. We evaluate a range of monolingual and multilingual pretrained models on the Twi language dataset, one among the 14 African languages included in the SemEval task. We introduce TwiBERT, a new pretrained model trained from scratch. We show that TwiBERT, along with mBERT, generally perform best when trained on the Twi dataset, achieving an F1 score of 64.29% on the official evaluation test data, which ranks 14 out of 30 of the total submissions for Track 10. The TwiBERT model is released at https://huggingface.co/sakrah/TwiBERT

This paper explains the participation of team Hitachi to SemEval-2023 Task 3 “Detecting the genre, the framing, and the persuasion techniques in online news in a multi-lingual setup.” Based on the multilingual, multi-task nature of the task and the low-resource setting, we investigated different cross-lingual and multi-task strategies for training the pretrained language models. Through extensive experiments, we found that (a) cross-lingual/multi-task training, and (b) collecting an external balanced dataset, can benefit the genre and framing detection. We constructed ensemble models from the results and achieved the highest macro-averaged F1 scores in Italian and Russian genre categorization subtasks.

pdf abs
nancy-hicks-gribble at SemEval-2023 Task 5: Classifying and generating clickbait spoilers with RoBERTa
Jüri Keller | Nicolas Rehbach | Ibrahim Zafar

Clickbait spoiling and spoiler type classification in the setting of the SemEval2023 shared task five was used to explore transformer based text classification in comparison to conventional, shallow learned classifying models. Additionally, an initial model for spoiler creation was explored. The task was to classify or create spoilers for clickbait social media posts. The classification task was addressed by comparing different classifiers trained on hand crafted features to pre-trained and fine-tuned RoBERTa transformer models. The spoiler generation task was formulated as a question answering task, using the clickbait posts as questions and the articles as foundation to retrieve the answer from. The results show that even of the shelve transformer models outperform shallow learned models in the classification task. The spoiler generation task is more complex and needs an advanced system.

pdf abs
Sakura at SemEval-2023 Task 2: Data Augmentation via Translation
Alberto Poncelas | Maksim Tkachenko | Ohnmar Htun

We demonstrate a simple yet effective approach to augmenting training data for multilingual named entity recognition using translations. The named entity spans from the original sentences are transferred to translations via word alignment and then filtered with the baseline recognizer. The proposed approach outperforms the baseline XLM-Roberta on the multilingual dataset.

pdf abs
Hitachi at SemEval-2023 Task 4: Exploring Various Task Formulations Reveals the Importance of Description Texts on Human Values
Masaya Tsunokake | Atsuki Yamaguchi | Yuta Koreeda | Hiroaki Ozaki | Yasuhiro Sogawa

This paper describes our participation in SemEval-2023 Task 4, ValueEval: Identification of Human Values behind Arguments. The aim of this task is to identify whether or not an input text supports each of the 20 pre-defined human values. Previous work on human value detection has shown the effectiveness of a sequence classification approach using BERT. However, little is known about what type of task formulation is suitable for the task. To this end, this paper explores various task formulations, including sequence classification, question answering, and question answering with chain-of-thought prompting and evaluates their performances on the shared task dataset. Experiments show that a zero-shot approach is not as effective as other methods, and there is no one approach that is optimal in every scenario. Our analysis also reveals that utilizing the descriptions of human values can help to improve performance.

pdf abs
DCU at SemEval-2023 Task 10: A Comparative Analysis of Encoder-only and Decoder-only Language Models with Insights into Interpretability
Kanishk Verma | Kolawole Adebayo | Joachim Wagner | Brian Davis

We conduct a comparison of pre-trained encoder-only and decoder-only language models with and without continued pre-training, to detect online sexism. Our fine-tuning-based classifier system achieved the 16th rank in the SemEval 2023 Shared Task 10 Subtask A that asks to distinguish sexist and non-sexist texts. Additionally, we conduct experiments aimed at enhancing the interpretability of systems designed to detect online sexism. Our findings provide insights into the features and decision-making processes underlying our classifier system, thereby contributing to a broader effort to develop explainable AI models to detect online sexism.

pdf abs
PMCoders at SemEval-2023 Task 1: RAltCLIP: Use Relative AltCLIP Features to Rank
Mohammad Javad Pirhadi | Motahhare Mirzaei | Mohammad Reza Mohammadi | Sauleh Eetemadi

Visual Word Sense Disambiguation (VWSD) task aims to find the most related image among 10 images to an ambiguous word in some limited textual context. In this work, we use AltCLIP features and a 3-layer standard transformer encoder to compare the cosine similarity between the given phrase and different images. Also, we improve our model’s generalization by using a subset of LAION-5B. The best official baseline achieves 37.20% and 54.39% macro-averaged hit rate and MRR (Mean Reciprocal Rank) respectively. Our best configuration reaches 39.61% and 56.78% macro-averaged hit rate and MRR respectively. The code will be made publicly available on GitHub.

This paper describes our system submitted to SemEval-2023 Task 5: Clickbait Spoiling. We work on spoiler generation of the subtask 2 and develop a system which comprises two parts: 1) simple seq2seq spoiler generation and 2) post-hoc model ensembling. Using this simple method, we address the challenge of generating multipart spoiler. In the test set, our submitted system outperformed the baseline by a large margin (approximately 10 points above on the BLEU score) for mixed types of spoilers. We also found that our system successfully handled the challenge of the multipart spoiler, confirming the effectiveness of our approach.

pdf abs
Tübingen at SemEval-2023 Task 4: What Can Stance Tell? A Computational Study on Detecting Human Values behind Arguments
Fidan Can

This paper describes the performance of a system which uses stance as an output instead of taking it as an input to identify 20 human values behind given arguments, based on two datasets for SemEval-2023 Task 4. The rationale was to draw a conclusion on whether predicting stance would help predict the given human values better. For this setup—predicting 21 labels—a pre-trained language model, RoBERTa-Large was used. The system had an F$_1$-score of 0.50 for predicting these human values for the main test set while this score was 0.35 on the secondary test set, and through further analysis, this paper aims to give insight into the problem of human value identification.

We present a system for natural language inference in breast cancer clinical trial reports, as framed by SemEval 2023 Task 7: Multi-evidence Natural Language Inference for Clinical Trial Data. In particular, we propose a suite of techniques for two related inference subtasks: entailment and evidence retrieval. The purpose of the textual entailment identification subtask is to determine the inference relation (either entailment or contradiction) between given statement pairs, while the goal of the evidence retrieval task is to identify a set of sentences that support this inference relation. To this end, we propose fine-tuning Bio+Clinical BERT, a BERT-based model pre-trained on clinical data. Along with presenting our system, we analyze our architectural decisions in the context of our model’s accuracy and conduct an error analysis. Overall, our system ranked 20 / 30 on the entailment subtask.

pdf abs
HEVS-TUW at SemEval-2023 Task 8: Ensemble of Language Models and Rule-based Classifiers for Claims Identification and PICO Extraction
Anjani Dhrangadhariya | Wojciech Kusa | Henning Müller | Allan Hanbury

This paper describes the HEVS-TUW team submission to the SemEval-2023 Task 8: Causal Claims. We participated in two subtasks: (1) causal claims detection and (2) PIO identification. For subtask 1, we experimented with an ensemble of weakly supervised question detection and fine-tuned Transformer-based models. For subtask 2 of PIO frame extraction, we used a combination of deep representation learning and a rule-based approach. Our best model for subtask 1 ranks fourth with an F1-score of 65.77%. It shows moderate benefit from ensembling models pre-trained on independent categories. The results for subtask 2 warrant further investigation for improvement.

pdf abs
Jus Mundi at SemEval-2023 Task 6: Using a Frustratingly Easy Domain Adaption for a Legal Named Entity Recognition System
Luis Adrián Cabrera-Diego | Akshita Gheewala

In this work, we present a Named Entity Recognition (NER) system that was trained using a Frustratingly Easy Domain Adaptation (FEDA) over multiple legal corpora. The goal was to create a NER capable of detecting 14 types of legal named entities in Indian judgments. Besides the FEDA architecture, we explored a method based on overlapping context and averaging tensors to process long input texts, which can be beneficial when processing legal documents. The proposed NER reached an F1-score of 0.9007 in the sub-task B of Semeval-2023 Task 6, Understanding Legal Texts.

In this paper, we discuss the methods we applied at SemEval-2023 Task 10: Towards the Explainable Detection of Online Sexism. Given an input text, we perform three classification tasks to predict whether the text is sexist and classify the sexist text into subcategories in order to provide an additional explanation as to why the text is sexist. We explored many different types of models, including GloVe embeddings as the baseline approach, transformer-based deep learning models like BERT, RoBERTa, and DeBERTa, ensemble models, and model blending. We explored various data cleaning and augmentation methods to improve model performance. Pre-training transformer models yielded significant improvements in performance, and ensembles and blending slightly improved robustness in the F1 score.

pdf abs
CodeNLP at SemEval-2023 Task 2: Data Augmentation for Named Entity Recognition by Combination of Sequence Generation Strategies
Micha Marcińczuk | Wiktor Walentynowicz

In the article, we present the CodeNLP submission to the SemEval-2023 Task 2: MultiCoNER II Multilingual Complex Named Entity Recognition. Our approach is based on data augmentation by combining various strategies of sequence generation for training. We show that the extended procedure of fine-tuning a pre-trained language model can bring improvements compared to any single strategy. On the development subsets, the improvements were 1.7 pp and 3.1 pp of F-measure, for English and multilingual datasets, respectively. On the test subsets our models achieved 63.51% and 73.22% of Macro F1, respectively.

pdf abs
SKAM at SemEval-2023 Task 10: Linguistic Feature Integration and Continuous Pretraining for Online Sexism Detection and Classification
Murali Manohar Kondragunta | Amber Chen | Karlo Slot | Sanne Weering | Tommaso Caselli

Sexism has been prevalent online. In this paper, we explored the effect of explicit linguistic features and continuous pretraining on the performance of pretrained language models in sexism detection. While adding linguistic features did not improve the performance of the model, continuous pretraining did slightly boost the performance of the model in Task B from a mean macro-F1 score of 0.6156 to 0.6246. The best mean macro-F1 score in Task A was achieved by a finetuned HateBERT model using regular pretraining (0.8331). We observed that the linguistic features did not improve the model’s performance. At the same time, continuous pretraining proved beneficial only for nuanced downstream tasks like Task-B.

pdf abs
ML Mob at SemEval-2023 Task 5: “Breaking News: Our Semi-Supervised and Multi-Task Learning Approach Spoils Clickbait”
Hannah Sterz | Leonard Bongard | Tobias Werner | Clifton Poth | Martin Hentschel

Online articles using striking headlines that promise intriguing information are often used to attract readers. Most of the time, the information provided in the text is disappointing to the reader after the headline promised exciting news. As part of the SemEval-2023 challenge, we propose a system to generate a spoiler for these headlines. The spoiler provides the information promised by the headline and eliminates the need to read the full article. We consider Multi-Task Learning and generating more data using a distillation approach in our system. With this, we achieve an F1 score up to 51.48% on extracting the spoiler from the articles.

pdf abs
FiRC at SemEval-2023 Task 10: Fine-grained Classification of Online Sexism Content Using DeBERTa
Fadi Hassan | Abdessalam Bouchekif | Walid Aransa

The SemEval 2023 shared task 10 “Explainable Detection of Online Sexism” focuses on detecting and identifying comments and tweets containing sexist expressions and also explaining why it is sexist. This paper describes our system that we used to participate in this shared task. Our model is an ensemble of different variants of fine tuned DeBERTa models that employs a k-fold cross-validation. We have participated in the three tasks A, B and C. Our model ranked 2 nd position in tasks A, 7 th in task B and 4 th in task C.

pdf abs
VBD_NLP at SemEval-2023 Task 2: Named Entity Recognition Systems Enhanced by BabelNet and Wikipedia
Phu Gia Hoang | Le Thanh | Hai-Long Trieu

We describe our systems participated in the SemEval-2023 shared task for Named Entity Recognition (NER) in English and Bangla. In order to address the challenges of the task, where a large number of fine-grained named entity types need to be detected with only a small amount of training data, we use a method to augment the training data based on BabelNet conceptsand Wikipedia redirections to automatically annotate named entities from Wikipedia articles. We build our NER systems based on the powerful mDeBERTa pretrained language model and trained on the augmented data. Our approach significantly enhances the performance of the fine-grained NER task in both English and Bangla subtracks, outperforming the baseline models. Specifically, our augmented systems achieve macro-f1 scores of 52.64% and 64.31%, representing improvements of 2.38% and 11.33% over the English and Bangla baselines, respectively.

pdf abs
Stephen Colbert at SemEval-2023 Task 5: Using Markup for Classifying Clickbait
Sabrina Spreitzer | Hoai Nam Tran

For SemEval-2023 Task 5, we have submitted three DeBERTaV3[LARGE] models to tackle the first subtask, classifying spoiler types (passage, phrase, multi) of clickbait web articles. The choice of basic parameters like sequence length with BERT[BASE] uncased and further approaches were then tested with DeBERTaV3[BASE] only moving the most promising ones to DeBERTaV3[LARGE]. Our research showed that information-placement on webpages is often optimized regarding e.g. ad-placement Those informations are usually described within the webpages markup which is why we conducted an approach that takes this into account. Overall we could not manage to beat the baseline, which we lead down to three reasons: First we only crawled markup for Huffington Post articles, extracting only p- and a-tags which will not cover enough aspects of a webpages design. Second Huffington Post articles are overrepresented in the given dataset, which, third, shows an imbalance towards the spoiler tags. We highly suggest re-annotating the given dataset to use markup-optimized models like MarkupLM or TIE and to clear it from embedded articles like “Yahoo” or archives like “archive.is” or “web.archive” to avoid noise. Also, the imbalance should be tackled by adding articles from sources other than Huffington Post, considering that also multi-tagged entries should be balanced towards passage- and phrase-tagged ones.

pdf abs
UCAS-IIE-NLP at SemEval-2023 Task 12: Enhancing Generalization of Multilingual BERT for Low-resource Sentiment Analysis
Dou Hu | Lingwei Wei | Yaxin Liu | Wei Zhou | Songlin Hu

This paper describes our system designed for SemEval-2023 Task 12: Sentiment analysis for African languages. The challenge faced by this task is the scarcity of labeled data and linguistic resources in low-resource settings. To alleviate these, we propose a generalized multilingual system SACL-XLMR for sentiment analysis on low-resource languages. Specifically, we design a lexicon-based multilingual BERT to facilitate language adaptation and sentiment-aware representation learning. Besides, we apply a supervised adversarial contrastive learning technique to learn sentiment-spread structured representations and enhance model generalization. Our system achieved competitive results, largely outperforming baselines on both multilingual and zero-shot sentiment classification subtasks. Notably, the system obtained the 1st rank on the zero-shot classification subtask in the official ranking. Extensive experiments demonstrate the effectiveness of our system.

pdf abs
Steno AI at SemEval-2023 Task 6: Rhetorical Role Labelling of Legal Documents using Transformers and Graph Neural Networks
Anshika Gupta | Shaz Furniturewala | Vijay Kumari | Yashvardhan Sharma

A legal document is usually long and dense requiring human effort to parse it. It also contains significant amounts of jargon which make deriving insights from it using existing models a poor approach. This paper presents the approaches undertaken to perform the task of rhetorical role labelling on Indian Court Judgements. We experiment with graph based approaches like Graph Convolutional Networks and Label Propagation Algorithm, and transformer-based approaches including variants of BERT to improve accuracy scores on text classification of complex legal documents.

pdf abs
Sebis at SemEval-2023 Task 7: A Joint System for Natural Language Inference and Evidence Retrieval from Clinical Trial Reports
Juraj Vladika | Florian Matthes

With the increasing number of clinical trial reports generated every day, it is becoming hard to keep up with novel discoveries that inform evidence-based healthcare recommendations. To help automate this process and assist medical experts, NLP solutions are being developed. This motivated the SemEval-2023 Task 7, where the goal was to develop an NLP system for two tasks: evidence retrieval and natural language inference from clinical trial data. In this paper, we describe our two developed systems. The first one is a pipeline system that models the two tasks separately, while the second one is a joint system that learns the two tasks simultaneously with a shared representation and a multi-task learning approach. The final system combines their outputs in an ensemble system. We formalize the models, present their characteristics and challenges, and provide an analysis of achieved results. Our system ranked 3rd out of 40 participants with a final submission.

pdf abs
Sren Kierkegaard at SemEval-2023 Task 4: Label-aware text classification using Natural Language Inference
Ignacio Talavera Cepeda | Amalie Pauli | Ira Assent

In this paper, we describe our approach to Task 4 in SemEval 2023. Our pipeline tries to solve the problem of multi-label text classification of human values in English-written arguments. We propose a label-aware system where we reframe the multi-label task into a binary task resembling an NLI task. We propose to include the semantic description of the human values by comparing each description to each argument and ask whether there is entailment or not.

pdf abs
Billy-Batson at SemEval-2023 Task 5: An Information Condensation based System for Clickbait Spoiling
Anubhav Sharma | Sagar Joshi | Tushar Abhishek | Radhika Mamidi | Vasudeva Varma

The Clickbait Challenge targets spoiling the clickbaits using short pieces of information known as spoilers to satisfy the curiosity induced by a clickbait post. The large context of the article associated with the clickbait and differences in the spoiler forms, make the task challenging. Hence, to tackle the large context, we propose an Information Condensation-based approach, which prunes down the unnecessary context. Given an article, our filtering module optimised with a contrastive learning objective first selects the parapraphs that are the most relevant to the corresponding clickbait.The resulting condensed article is then fed to the two downstream tasks of spoiler type classification and spoiler generation. We demonstrate and analyze the gains from this approach on both the tasks. Overall, we win the task of spoiler type classification and achieve competitive results on spoiler generation.

pdf abs
Francis Wilde at SemEval-2023 Task 5: Clickbait Spoiler Type Identification with Transformers
Vijayasaradhi Indurthi | Vasudeva Varma

Clickbait is the text or a thumbnail image that entices the user to click the accompanying link. Clickbaits employ strategies while deliberately hiding the critical elements of the article and revealing partial information in the title, which arouses sufficient curiosity and motivates the user to click the link. In this work, we identify the kind of spoiler given a clickbait title. We formulate this as a text classification problem. We finetune pretrained transformer models on the title of the post and build models for theclickbait-spoiler classification. We achieve a balanced accuracy of 0.70 which is close to the baseline.

pdf abs
DH-FBK at SemEval-2023 Task 10: Multi-Task Learning with Classifier Ensemble Agreement for Sexism Detection
Elisa Leonardelli | Camilla Casula

This paper presents the submissions of the DH-FBK team for the three tasks of Task 10 at SemEval 2023. The Explainable Detection of Online Sexism (EDOS) task aims at detecting sexism in English text in an accurate and explainable way, thanks to a fine-grained annotation that follows a three-level schema: sexist or not (Task A), category of sexism (Task B) and vector of sexism (Task C) exhibited. We use a multi-task learning approach in which models share representations from all three tasks, allowing for knowledge to be shared across them. Notably, with our approach a single model can solve all three tasks. In addition, motivated by the subjective nature of the task, we incorporate inter-annotator agreement information in our multi-task architecture. Although disaggregated annotations are not available, we artificially estimate them using a 5-classifier ensemble, and show that ensemble agreement can be a good approximation of crowd agreement. Our approach achieves competitive results, ranking 32nd out of 84, 24th out of 69 and 11th out of 63 for Tasks A, B and C respectively. We finally show that low inter-annotator agreement levels are associated with more challenging examples for models, making agreement information use ful for this kind of task.

pdf abs
Jack-flood at SemEval-2023 Task 5:Hierarchical Encoding and Reciprocal Rank Fusion-Based System for Spoiler Classification and Generation
Sujit Kumar | Aditya Sinha | Soumyadeep Jana | Rahul Mishra | Sanasam Ranbir Singh

The rise of social media has exponentially witnessed the use of clickbait posts that grab users’ attention. Although work has been done to detect clickbait posts, this is the first task focused on generating appropriate spoilers for these potential clickbaits. This paper presents our approach in this direction. We use different encoding techniques that capture the context of the post text and the target paragraph. We propose hierarchical encoding with count and document length feature-based model for spoiler type classification which uses Recurrence over Pretrained Encoding. We also propose combining multiple ranking with reciprocal rank fusion for passage spoiler retrieval and question-answering approach for phrase spoiler retrieval. For multipart spoiler retrieval, we combine the above two spoiler retrieval methods. Experimental results over the benchmark suggest that our proposed spoiler retrieval methods are able to retrieve spoilers that are semantically very close to the ground truth spoilers.

pdf abs
KingsmanTrio at SemEval-2023 Task 10: Analyzing the Effectiveness of Transfer Learning Models for Explainable Online Sexism Detection
Fareen Tasneem | Tashin Hossain | Jannatun Naim

Online social platforms are now propagating sexist content endangering the involvement and inclusion of women on these platforms. Sexism refers to hostility, bigotry, or discrimination based on gender, typically against women. The proliferation of such notions deters women from engaging in social media spontaneously. Hence, detecting sexist content is critical to ensure a safe online platform where women can participate without the fear of being a target of sexism. This paper describes our participation in subtask A of SemEval-2023 Task 10: Explainable Detection of Online Sexism (EDOS). This subtask requires classifying textual content as sexist or not sexist. We incorporate a RoBERTa-based architecture and further finetune the hyperparameters to entail better performance. The procured results depict the competitive performance of our approach among the other participants.

pdf abs
SLT at SemEval-2023 Task 1: Enhancing Visual Word Sense Disambiguation through Image Text Retrieval using BLIP
Mohammadreza Molavi | Hossein Zeinali

Based on recent progress in image-text retrieval techniques, this paper presents a fine-tuned model for the Visual Word Sense Disambiguation (VWSD) task. The proposed system fine-tunes a pre-trained model using ITC and ITM losses and employs a candidate selection approach for faster inference. The system was trained on the VWSD task dataset and evaluated on a separate test set using Mean Reciprocal Rank (MRR) metric. Additionally, the system was tested on the provided test set which contained Persian and Italian languages, and the results were evaluated on each language separately. Our proposed system demonstrates the potential of fine-tuning pre-trained models for complex language tasks and provides insights for further research in the field of image text retrieval.

pdf abs
CAIR-NLP at SemEval-2023 Task 2: A Multi-Objective Joint Learning System for Named Entity Recognition
Sangeeth N | Biswajit Paul | Chandramani Chaudhary

This paper describes the NER system designed by the CAIR-NLP team for submission to Multilingual Complex Named Entity Recognition (MultiCoNER II) shared task, which presents a novel challenge of recognizing complex, ambiguous, and fine-grained entities in low-context, multi-lingual, multi-domain dataset and evaluation on the noisy subset. We propose a Multi-Objective Joint Learning System (MOJLS) for NER, which aims to enhance the representation of entities and improve label predictions through the joint implementation of a set of learning objectives. Our official submission MOJLS implements four objectives. These include the representation of the named entities should be close to its entity type definition, low-context inputs should have representation close to their augmented context, and also minimization of two label prediction errors, one based on CRF and another biaffine-based predictions, where both are producing similar output label distributions. The official results ranked our system 2nd in five tracks (Multilingual, Spanish, Swedish, Ukrainian, and Farsi) and 3 rd in three (French, Italian, and Portuguese) out of 13 tracks. Also evaluation of the noisy subset, our model achieved relatively better ranks. Official results indicate the effectiveness of the proposed MOJLS in dealing with the contemporary challenges of NER.

pdf abs
BpHigh at SemEval-2023 Task 7: Can Fine-tuned Cross-encoders Outperform GPT-3.5 in NLI Tasks on Clinical Trial Data?
Bhavish Pahwa | Bhavika Pahwa

Many nations and organizations have begun collecting and storing clinical trial records for storage and analytical purposes so that medical and clinical practitioners can refer to them on a centralized database over the internet and stay updated with the current clinical information. The amount of clinical trial records have gone through the roof, making it difficult for many medical and clinical practitioners to stay updated with the latest information. To help and support medical and clinical practitioners, there is a need to build intelligent systems that can update them with the latest information in a byte-sized condensed format and, at the same time, leverage their understanding capabilities to help them make decisions. This paper describes our contribution to SemEval 2023 Task 7: Multi-evidence Natural Language Inference for Clinical Trial Data (NLI4CT). Our results show that there is still a need to build domain-specific models as smaller transformer-based models can be finetuned on that data and outperform foundational large language models like GPT-3.5. We also demonstrate how the performance of GPT-3.5 can be increased using few-shot prompting by leveraging the semantic similarity of the text samples and the few-shot train snippets. We will also release our code and our models on open source hosting platforms, GitHub and HuggingFace.

pdf abs
WADER at SemEval-2023 Task 9: A Weak-labelling framework for Data augmentation in tExt Regression Tasks
Manan Suri | Aaryak Garg | Divya Chaudhary | Ian Gorton | Bijendra Kumar

Intimacy is an essential element of human relationships and language is a crucial means of conveying it. Textual intimacy analysis can reveal social norms in different contexts and serve as a benchmark for testing computational models’ ability to understand social information. In this paper, we propose a novel weak-labeling strategy for data augmentation in text regression tasks called WADER. WADER uses data augmentation to address the problems of data imbalance and data scarcity and provides a method for data augmentation in cross-lingual, zero-shot tasks. We benchmark the performance of State-of-the-Art pre-trained multilingual language models using WADER and analyze the use of sampling techniques to mitigate bias in data and optimally select augmentation candidates. Our results show that WADER outperforms the baseline model and provides a direction for mitigating data imbalance and scarcity in text regression tasks.

The computational identification of human values is a novel and challenging research that holds the potential to offer valuable insights into the nature of human behavior and cognition. This paper presents the methodology adopted by the Arthur-Caplan research team for the SemEval-2023 Task 4, which entailed the detection of human values behind arguments. The proposed system integrates BERT, ERNIE2.0, RoBERTA and XLNet models with fine tuning. Experimental results show that the macro F1 score of our system achieved 0.512, which overperformed baseline methods by 9.2% on the test set.

This paper presents an approach to tackle the task of Visual Word Sense Disambiguation (Visual-WSD), which involves determining the most appropriate image to represent a given polysemous word in one of its particular senses. The proposed approach leverages the CLIP model, prompt engineering, and text-to-image models such as GLIDE and DALL-E 2 for both image retrieval and generation. To evaluate our approach, we participated in the SemEval 2023 shared task on “Visual Word Sense Disambiguation (Visual-WSD)” using a zero-shot learning setting, where we compared the accuracy of different combinations of tools, including “Simple prompt-based” methods and “Generated prompt-based” methods for prompt engineering using completion models, and text-to-image models for changing input modality from text to image. Moreover, we explored the benefits of cross-modality evaluation between text and candidate images using CLIP. Our experimental results demonstrate that the proposed approach reaches better results than cross-modality approaches, highlighting the potential of prompt engineering and text-to-image models to improve accuracy in Visual-WSD tasks. We assessed our approach in a zero-shot learning scenario and attained an accuracy of 68.75\% in our best attempt.

pdf abs
SzegedAI at SemEval-2023 Task 1: Applying Quasi-Symbolic Representations in Visual Word Sense Disambiguation
Gábor Berend

In this paper, we introduce our submission in the task of visual word sense disambiguation (vWSD). Our proposed solution operates by deriving quasi-symbolic semantic categories from the hidden representations of multi-modal text-image encoders. Our results are mixed, as we manage to achieve a substantial boost in performance when evaluating on a validation set, however, we experienced detrimental effects during evaluation on the actual test set. Our positive results on the validation set confirms the validity of the quasi-symbolic features, whereas our results on the test set revealed that the proposed technique was not able to cope with the sufficiently different distribution of the test data.

pdf abs
Attention at SemEval-2023 Task 10: Explainable Detection of Online Sexism (EDOS)
Debashish Roy | Manish Shrivastava

In this paper, we have worked on explainability and understanding of the decisions made by models in the form of classification tasks. The task is divided into 3 subtasks. The first task consists of determining Binary Sexism Detection. The second task describes the Category of Sexism. The third task describes a more Fine-grained Category of Sexism. Our work explores solving these tasks as a classification problem by fine-tuning transformer-based architecture. We have performed several experiments with our architecture, including combining multiple transformers, using domain adaptive pretraining on the unlabelled dataset provided by Reddit and Gab, Joint learning, and taking different layers of transformers as input to a classification head. Our system (with the team name Attention’) was able to achieve a macro F1 score of 0.839 for task A, 0.5835 macro F1 score for task B and 0.3356 macro F1 score for task C at the Codalab SemEval Competition. Later we improved the accuracy of Task B to 0.6228 and Task C to 0.3693 in the test set.

pdf abs
Dragonfly_captain at SemEval-2023 Task 11: Unpacking Disagreement with Investigation of Annotator Demographics and Task Difficulty
Ruyuan Wan | Karla Badillo-Urquiola

This study investigates learning with disagreement in NLP tasks and evaluates its performance on four datasets. The results suggest that the model performs best on the experimental dataset and faces challenges in minority languages. Furthermore, the analysis indicates that annotator demographics play a significant role in the interpretation of such tasks. This study suggests the need for greater consideration of demographic differences in annotators and more comprehensive evaluation metrics for NLP models.

We present the findings of our participation in the SemEval-2023 Task 10: Explainable Detection of Online Sexism (EDOS) task, a shared task on offensive language (sexism) detection on English Gab and Reddit dataset. We investigated the effects of transferring two language models: XLM-T (sentiment classification) and HateBERT (same domain - Reddit) for multilevel classification into Sexist or not Sexist, and other subsequent sub-classifications of the sexist data. We also use synthetic classification of unlabelled dataset and intermediary class information to maximize the performance of our models. We submitted a system in Task A, and it ranked 49th with F1-score of 0.82. This result showed to be competitive as it only under-performed the best system by 0.052%F1-score.

pdf abs
CSECU-DSG at SemEval-2023 Task 4: Fine-tuning DeBERTa Transformer Model with Cross-fold Training and Multi-sample Dropout for Human Values Identification
Abdul Aziz | Md. Akram Hossain | Abu Nowshed Chy

Human values identification from a set of argument is becoming a prominent area of research in argument mining. Among some options, values convey what may be the most desirable and widely accepted answer. The diversity of human beliefs, random texture and implicit meaning within the arguments makes it more difficult to identify human values from the arguments. To address these challenges, SemEval-2023 Task 4 introduced a shared task ValueEval focusing on identifying human values categories based on given arguments. This paper presents our participation in this task where we propose a finetuned DeBERTa transformers-based classification approach to identify the desire human value category. We utilize different training strategy with the finetuned DeBERTa model to enhance contextual representation on this downstream task. Our proposed method achieved competitive performance among the participants’ methods.

This paper describes our approach for SemEval- 2023 Task 3: Detecting the category, the fram- ing, and the persuasion techniques in online news in a multilingual setup. For Subtask 1 (News Genre), we propose an ensemble of fully trained and adapter mBERT models which was ranked joint-first for German, and had the high- est mean rank of multi-language teams. For Subtask 2 (Framing), we achieved first place in 3 languages, and the best average rank across all the languages, by using two separate ensem- bles: a monolingual RoBERTa-MUPPETLARGE and an ensemble of XLM-RoBERTaLARGE with adapters and task adaptive pretraining. For Sub- task 3 (Persuasion Techniques), we trained a monolingual RoBERTa-Base model for English and a multilingual mBERT model for the re- maining languages, which achieved top 10 for all languages, including 2nd for English. For each subtask, we compared monolingual and multilingual approaches, and considered class imbalance techniques.

pdf abs
CKingCoder at SemEval-2023 Task 9: Multilingual Tweet Intimacy Analysis
Harish B | Naveen D | Prem Balasubramanian | Aarthi S

The SemEval 2023 Task 9 Multilingual Tweet Intimacy Analysis, is a shared task for analysing the intimacy in the tweets posted on Twitter. The dataset was provided by Pei and Jurgens, who are part of the task organisers, for this task consists of tweets in various languages, such as Chinese, English, French, Italian, Portuguese, and Spanish. The testing dataset also had unseen languages such as Hindi, Arabic, Dutch and Korean. The tweets may or may not be related to intimacy. The task of our team was to score the intimacy in tweets and place it in the range of 05 based on the level of intimacy in the tweet using the dataset provided which consisted of tweets along with its scores. The intimacy score is used to indicate whether a tweet is intimate or not. Our team participated in the task and proposed the ROBERTa model to analyse the intimacy of the tweets.

The MultiCoNER II shared task aims to tackle multilingual named entity recognition (NER) in fine-grained and noisy scenarios, and it inherits the semantic ambiguity and low-context setting of the MultiCoNER I task. To cope with these problems, the previous top systems in the MultiCoNER I either incorporate the knowledge bases or gazetteers. However, they still suffer from insufficient knowledge, limited context length, single retrieval strategy. In this paper, our team DAMO-NLP proposes a unified retrieval-augmented system (U-RaNER) for fine-grained multilingual NER. We perform error analysis on the previous top systems and reveal that their performance bottleneck lies in insufficient knowledge. Also, we discover that the limited context length causes the retrieval knowledge to be invisible to the model. To enhance the retrieval context, we incorporate the entity-centric Wikidata knowledge base, while utilizing the infusion approach to broaden the contextual scope of the model. Also, we explore various search strategies and refine the quality of retrieval knowledge. Our system wins 9 out of 13 tracks in the MultiCoNER II shared task. Additionally, we compared our system with ChatGPT, one of the large language models which have unlocked strong capabilities on many tasks. The results show that there is still much room for improvement for ChatGPT on the extraction task.

pdf abs
ROZAM at SemEval 2023 Task 9: Multilingual Tweet Intimacy Analysis
Mohammadmostafa Rostamkhani | Ghazal Zamaninejad | Sauleh Eetemadi

We build a model using large multilingual pretrained language model XLM-T for regression task and fine-tune it on the MINT (Multilingual INTmacy) analysis dataset which covers 6 languages for training and 4 languages for testing zero-shot performance of the model. The dataset was annotated and the annotations are intimacy scores. We experiment with several deep learning architectures to predict intimacy score. To achieve optimal performance we modify several model settings including loss function, number and type of layers. In total, we ran 16 end-to-end experiments. Our best system achieved a Pearson Correlation score of 0.52.

pdf abs
Prodicus at SemEval-2023 Task 4: Enhancing Human Value Detection with Data Augmentation and Fine-Tuned Language Models
Erfan Moosavi Monazzah | Sauleh Eetemadi

This paper introduces a data augmentation technique for the task of detecting human values. Our approach involves generating additional examples using metadata that describes the labels in the datasets. We evaluated the effectiveness of our method by fine-tuning BERT and RoBERTa models on our augmented dataset and comparing their F1 -scores to those of the non-augmented dataset. We obtained competitive results on both the Main test set and the Nahj al-Balagha test set, ranking 14th and 7th respectively among the participants. We also demonstrate that by incorporating our augmentation technique, the classification performance of BERT and RoBERTa is improved, resulting in an increase of up to 10.1% in their F1-score.

In this paper, we discuss our efforts on SemEval-2023 Task4, a task to classify the human value categoriesthat an argument draws on. Arguments consist of a premise, conclusion,and the premise’s stance on the conclusion. Our team experimented with GloVe embeddings and fine-tuning BERT. We found that an ensembling of BERT and GloVe with RidgeRegression worked the best.

pdf abs
UAlberta at SemEval-2023 Task 1: Context Augmentation and Translation for Multilingual Visual Word Sense Disambiguation
Michael Ogezi | Bradley Hauer | Talgat Omarov | Ning Shi | Grzegorz Kondrak

We describe the systems of the University of Alberta team for the SemEval-2023 Visual Word Sense Disambiguation (V-WSD) Task. We present a novel algorithm that leverages glosses retrieved from BabelNet, in combination with text and image encoders. Furthermore, we compare language-specific encoders against the application of English encoders to translated texts. As the contexts given in the task datasets are extremely short, we also experiment with augmenting these contexts with descriptions generated by a language model. This yields substantial improvements in accuracy. We describe and evaluate additional V-WSD methods which use image generation and text-conditioned image segmentation. Some of our experimental results exceed those of our official submissions on the test set. Our code is publicly available at https://github.com/UAlberta-NLP/v-wsd.

pdf abs
iREL at SemEval-2023 Task 9: Improving understanding of multilingual Tweets using Translation-Based Augmentation and Domain Adapted Pre-Trained Models
Bhavyajeet Singh | Ankita Maity | Pavan Kandru | Aditya Hari | Vasudeva Varma

This paper describes our system (iREL) for Tweet intimacy analysis sharedtask of the SemEval 2023 workshop at ACL 2023. Oursystem achieved an overall Pearson’s r score of 0.5924 and ranked 10th on the overall leaderboard. For the unseen languages, we ranked third on the leaderboard and achieved a Pearson’s r score of 0.485. We used a single multilingual model for all languages, as discussed in this paper. We provide a detailed description of our pipeline along with multiple ablation experiments to further analyse each component of the pipeline. We demonstrate how translation-based augmentation, domain-specific features, and domain-adapted pre-trained models improve the understanding of intimacy in tweets. The codecan be found at \href{https://github.com/bhavyajeet/Multilingual-tweet-intimacy}{https://github.com/bhavyajeet/Multilingual-tweet-intimacy}

pdf abs
Team TheSyllogist at SemEval-2023 Task 3: Language-Agnostic Framing Detection in Multi-Lingual Online News: A Zero-Shot Transfer Approach
Osama Mohammed Afzal | Preslav Nakov

We describe our system for SemEval-2022 Task 3 subtask 2 which on detecting the frames used in a news article in a multi-lingual setup. We propose a multi-lingual approach based on machine translation of the input, followed by an English prediction model. Our system demonstrated good zero-shot transfer capability, achieving micro-F1 scores of 53% for Greek (4th on the leaderboard) and 56.1% for Georgian (3rd on the leaderboard), without any prior training on translated data for these languages. Moreover, our system achieved comparable performance on seven other languages, including German, English, French, Russian, Italian, Polish, and Spanish. Our results demonstrate the feasibility of creating a language-agnostic model for automatic framing detection in online news.

pdf abs
Tenzin-Gyatso at SemEval-2023 Task 4: Identifying Human Values behind Arguments Using DeBERTa
Pavan Kandru | Bhavyajeet Singh | Ankita Maity | Kancharla Aditya Hari | Vasudeva Varma

Identifying human values behind arguments isa complex task which requires understandingof premise, stance and conclusion together. Wepropose a method that uses a pre-trained lan-guage model, DeBERTa, to tokenize and con-catenate the text before feeding it into a fullyconnected neural network. We also show thatleveraging the hierarchy in values improves theperformance by .14 F1 score.

pdf abs
MilaNLP at SemEval-2023 Task 10: Ensembling Domain-Adapted and Regularized Pretrained Language Models for Robust Sexism Detection
Amanda Cercas Curry | Giuseppe Attanasio | Debora Nozza | Dirk Hovy

We present the system proposed by the MilaNLP team for the Explainable Detection of Online Sexism (EDOS) shared task. We propose an ensemble modeling approach to combine different classifiers trained with domain adaptation objectives and standard fine-tuning. Our results show that the ensemble is more robust than individual models and that regularized models generate more “conservative” predictions, mitigating the effects of lexical overfitting.However, our error analysis also finds that many of the misclassified instances are debatable, raising questions about the objective annotatability of hate speech data.

pdf abs
YNU-HPCC at SemEval-2023 Task 6: LEGAL-BERT Based Hierarchical BiLSTM with CRF for Rhetorical Roles Prediction
Yu Chen | You Zhang | Jin Wang | Xuejie Zhang

To understand a legal document for real-world applications, SemEval-2023 Task 6 proposes a shared Subtask A, rhetorical roles (RRs) prediction, which requires a system to automatically assign a RR label for each semantical segment in a legal text. In this paper, we propose a LEGAL-BERT based hierarchical BiLSTM model with conditional random field (CRF) for RR prediction, which primarily consists of two parts: word-level and sentence-level encoders. The word-level encoder first adopts a legal-domain pre-trained language model, LEGAL-BERT, initially word-embedding words in each sentence in a document and a word-level BiLSTM further encoding such sentence representation. The sentence-level encoder then uses an attentive pooling method for sentence embedding and a sentence-level BiLSTM for document modeling. Finally, a CRF is utilized to predict RRs for each sentence. The officially released results show that our method outperformed the baseline systems. Our team won 7th rank out of 27 participants in Subtask A.

Under the umbrella of anonymous social networks, many women have suffered from abuse, discrimination, and other sexist expressions online. However, exsiting methods based on keyword filtering and matching performed poorly on online sexism detection, which lacked the capability to identify implicit stereotypes and discrimination. Therefore, this paper proposes a System of Ensembling Fine-tuning Models (SEFM) at SemEval-2023 Task 10: Explainable Detection of Online Sexism. We firstly use four task-adaptive pre-trained language models to flag all texts. Secondly, we alleviate the data imbalance from two perspectives: over-sampling the labelled data and adjusting the loss function. Thirdly, we add indicators and feedback modules to enhance the overall performance. Our system attained macro F1 scores of 0.8538, 0.6619, and 0.4641 for Subtask A, B, and C, respectively. Our system exhibited strong performance across multiple tasks, with particularly noteworthy performance in Subtask B. Comparison experiments and ablation studies demonstrate the effectiveness of our system.

pdf abs
CSECU-DSG at SemEval-2023 Task 10: Exploiting Transformers with Stacked LSTM for the Explainable Detection of Online Sexism
Afrin Sultana | Radiathun Tasnia | Nabila Ayman | Abu Nowshed Chy

Sexism is a harmful phenomenon that provokes gender inequalities and social imbalances. The expanding application of sexist content on social media platforms creates an unwelcoming and discomforting environment for many users. The implication of sexism is a multi-faceted subject as it can be integrated with other categories of discrimination. Binary classification tools are frequently employed to identify sexist content, but most of them provide extensive, generic categories with no further insights. SemEval-2023 introduced the Explainable Detection of Online Sexism (EDOS) task that emphasizes detecting and explaining the category of sexist content. The content of this paper details our involvement in this task where we present a neural network architecture employing document embeddings from a fine-tuned transformer-based model into stacked long short-term memory (LSTM) and a fully connected linear (FCL) layer . Our proposed methodology obtained an F1 score of 0.8218 (ranked 51st) in Task A. It achieved an F1 score of 0.5986 (ranked 40th) and 0.4419 (ranked 28th) in Tasks B and C, respectively.

pdf abs
John Boy Walton at SemEval-2023 Task 5: An Ensemble Approach to Spoiler Classification and Retrieval for Clickbait Spoiling
Maksim Shmalts

Clickbait spoiling is a task of generating or retrieving a fairly short text with a purpose to satisfy curiosity of a content consumer without their addressing to the document linked to a clickbait post or headline. In this paper we introduce an ensemble approach to clickbait spoiling task at SemEval-2023. The tasks consists of spoiler classification and retrieval on Webis-Clickbait-22 dataset. We show that such an ensemble solution is quite successful at classification, whereas it might perform poorly at retrieval with no additional features. In conclusion we outline our thoughts on possible directions to improving the approach and shape a set of suggestions to the said features.

pdf abs
TeamEC at SemEval-2023 Task 4: Transformers vs. Low-Resource Dictionaries, Expert Dictionary vs. Learned Dictionary
Nicolas Stefanovitch | Bertrand De Longueville | Mario Scharfbillig

This paper describes the system we used to participate in the shared task, as well as additional experiments beyond the scope of the shared task, but using its data. Our primary goal is to compare the effectiveness of transformers model compared to low-resource dictionaries. Secondly, we compare the difference in performance of a learned dictionary and of a dictionary designed by experts in the field of values. Our findings surprisingly show that transformers perform on par with a dictionary containing less than 1k words, when evaluated with 19 fine-grained categories, and only outperform a dictionary-based approach in a coarse setting with 10 categories. Interestingly, the expert dictionary has a precision on par with the learned one, while its recall is clearly lower, potentially an indication of overfitting of topics to values in the shared task’s dataset. Our findings should be of interest to both the NLP and Value scientific communities on the use of automated approaches for value classification

pdf abs
CSECU-DSG at SemEval-2023 Task 6: Segmenting Legal Documents into Rhetorical Roles via Fine-tuned Transformer Architecture
Fareen Tasneem | Tashin Hossain | Jannatun Naim | Abu Nowshed Chy

Automated processing of legal documents is essential to manage the enormous volume of legal corpus and to make it easily accessible to a broad spectrum of people. But due to the amorphous and variable nature of legal documents, it is very challenging to directly proceed with complicated processes such as summarization, analysis, and query. Segmenting the documents as per the rhetorical roles can aid and accelerate such procedures. This paper describes our participation in SemEval-2023 task 6: Sub-task A: Rhetorical Roles Prediction. We utilize a finetuned Legal-BERT to address this task. We also conduct an error analysis to illustrate the shortcomings of our deployed approach.

pdf abs
CAISA at SemEval-2023 Task 8: Counterfactual Data Augmentation for Mitigating Class Imbalance in Causal Claim Identification
Akbar Karimi | Lucie Flek

Class imbalance problem can cause machine learning models to produce an undesirable performance on the minority class as well as the whole dataset. Using data augmentation techniques to increase the number of samples is one way to tackle this problem. We introduce a novel counterfactual data augmentation by verb replacement for the identification of medical claims. In addition, we investigate the impact of this method and compare it with 3 other data augmentation techniques, showing that the proposed method can result in significant (relative) improvement on the minority class.

pdf abs
ReDASPersuasion at SemEval-2023 Task 3: Persuasion Detection using Multilingual Transformers and Language Agnostic Features
Fatima Zahra Qachfar | Rakesh Verma

This paper describes a multilingual persuasion detection system that incorporates persuasion technique attributes for a multi-label classification task. The proposed method has two advantages. First, it combines persuasion features with a sequence classification transformer to classify persuasion techniques. Second, it is a language agnostic approach that supports a total of 100 languages, guaranteed by the multilingual transformer module and the Google translator interface. We found that our persuasion system outperformed the SemEval baseline in all languages except zero shot prediction languages, which did not constitute the main focus of our research. With the highest F1-Micro score of 0.45, Italian achieved the eighth position on the leaderboard.

pdf abs
IREL at SemEval-2023 Task 11: User Conditioned Modelling for Toxicity Detection in Subjective Tasks
Ankita Maity | Pavan Kandru | Bhavyajeet Singh | Kancharla Aditya Hari | Vasudeva Varma

This paper describes our system used in the SemEval-2023 Task 11 Learning With Disagreements (Le-Wi-Di). This is a subjective task since it deals with detecting hate speech, misogyny and offensive language. Thus, disagreement among annotators is expected. We experiment with different settings like loss functions specific for subjective tasks and include anonymized annotator-specific information to help us understand the level of disagreement. We perform an in-depth analysis of the performance discrepancy of these different modelling choices. Our system achieves a cross-entropy of 0.58, 4.01 and 3.70 on the test sets of HS-Brexit, ArMIS and MD-Agreement, respectively. Our code implementation is publicly available.

pdf abs
Arguably at SemEval-2023 Task 11: Learning the disagreements using unsupervised behavioral clustering and language models
Guneet Singh Kohli | Vinayak Tiwari

We describe SemEval-2023 Task 11 on behavioral segregation of annotations to find the similarities and contextual thinking of a group of annotators. We have utilized a behavioral segmentation analysis on the annotators to model them independently and combine the results to yield soft and hard scores. Our team focused on experimenting with hierarchical clustering with various distance metrics for similarity, dissimilarity, and reliability. We modeled the clusters and assigned weightage to find the soft and hard scores. Our team was able to find out hidden behavioral patterns among the judgments of annotators after rigorous experiments. The proposed system is made available.

pdf abs
MasonNLP+ at SemEval-2023 Task 8: Extracting Medical Questions, Experiences and Claims from Social Media using Knowledge-Augmented Pre-trained Language Models
Giridhar Kaushik Ramachandran | Haritha Gangavarapu | Kevin Lybarger | Ozlem Uzuner

In online forums like Reddit, users share their experiences with medical conditions and treatments, including making claims, asking questions, and discussing the effects of treatments on their health. Building systems to understand this information can effectively monitor the spread of misinformation and verify user claims. The Task-8 of the 2023 International Workshop on Semantic Evaluation focused on medical applications, specifically extracting patient experience- and medical condition-related entities from user posts on social media. The Reddit Health Online Talk (RedHot) corpus contains posts from medical condition-related subreddits with annotations characterizing the patient experience and medical conditions. In Subtask-1, patient experience is characterized by personal experience, questions, and claims. In Subtask-2, medical conditions are characterized by population, intervention, and outcome. For the automatic extraction of patient experiences and medical condition information, as a part of the challenge, we proposed language-model-based extraction systems that ranked $3ˆ{rd}$ on both subtasks’ leaderboards. In this work, we describe our approach and, in addition, explore the automatic extraction of this information using domain-specific language models and the inclusion of external knowledge.

pdf abs
Howard University Computer Science at SemEval-2023 Task 12: A 2-Step System Design for Multilingual Sentiment Classification with Language Identification
Saurav Aryal | Howard Prioleau

The recent release of the AfriSenti-SemEval shared Task 12 has made available 14 new datasets annotated for sentiment analysis on African Languages. We proposed and evaluated two approaches to this task, Delta TF-IDF, and a proposed Language-Specific Model Fusion Algorithm using Language Identification, both of which produced comparable or better classification performance than the current state-of-art models on this task: AfriBERTa, AfroXLMR, and AfroLM.

pdf abs
SUT at SemEval-2023 Task 1: Prompt Generation for Visual Word Sense Disambiguation
Omid Ghahroodi | Seyed Arshan Dalili | Sahel Mesforoush | Ehsaneddin Asgari

Visual Word Sense Disambiguation (V-WSD) identifies the correct visual sense of a multi-sense word in a specific context. This can be challenging as images may need to provide additional context and words may have multiple senses. A proper V-WSD system can benefit applications like image retrieval and captioning. This paper proposes a Prompt Generation approach to solve this challenge. This approach improves the robustness of language-image models like CLIP to contextual ambiguities and helps them better correlate between textual and visual contexts of different senses of words.

The human values expressed in argumentative texts can provide valuable insights into the culture of a society. They can be helpful in various applications such as value-based profiling and ethical analysis. However, one of the first steps in achieving this goal is to detect the category of human value from an argument accurately. This task is challenging due to the lack of data and the need for philosophical inference. It also can be challenging for humans to classify arguments according to their underlying human values. This paper elaborates on our model for the SemEval 2023 Task 4 on human value detection. We propose a class-token attention-based model and evaluate it against baseline models, including finetuned BERT language model and a keyword-based approach.

pdf abs
SinaAI at SemEval-2023 Task 3: A Multilingual Transformer Language Model-based Approach for the Detection of News Genre, Framing and Persuasion Techniques
Aryan Sadeghi | Reza Alipour | Kamyar Taeb | Parimehr Morassafar | Nima Salemahim | Ehsaneddin Asgari

This paper describes SinaAI’s participation in SemEval-2023 Task 3, which involves detecting propaganda in news articles across multiple languages. The task comprises three sub-tasks: (i) genre detection, (ii) news framing,and (iii) persuasion technique identification. The employed dataset includes news articles in nine languages and domains, including English, French, Italian, German, Polish, Russian, Georgian, Greek, and Spanish, with labeled instances of news framing, genre, and persuasion techniques. Our approach combines fine-tuning multilingual language models such as XLM, LaBSE, and mBERT with data augmentation techniques. Our experimental results show that XLM outperforms other models in terms of F1-Micro in and F1-Macro, and the ensemble of XLM and LaBSE achieved the best performance. Our study highlights the effectiveness of multilingual sentence embedding models in multilingual propaganda detection. Our models achieved highest score for two languages (greek and italy) in sub-task 1 and one language (Russian) for sub-task 2.

pdf abs
RCLN at SemEval-2023 Task 1: Leveraging Stable Diffusion and Image Captions for Visual WSD
Antonina Mijatovic | Davide Buscaldi | Ekaterina Borisova

This paper describes the participation of the RCLN team at the Visual Word Sense Disambiguation task at SemEval 2023. The participation was focused on the use of CLIP as a base model for the matching between text and images with additional information coming from captions generated from images and the generation of images from the prompt text using Stable Diffusion. The results we obtained are not particularly good, but interestingly enough, we were able to improve over the CLIP baseline in Italian by recurring simply to the generated images.

pdf abs
Friedrich Nietzsche at SemEval-2023 Task 4: Detection of Human Values from Text Using Machine Learning
Abdul Jawad Mohammed | Sruthi Sundharram | Sanidhya Sharma

Literature permeates through almost every facet of our lives, whether through books, magazines, or internet articles. Moreover, every piece of written work contains ideas and opinions that we tend to relate to, accept or disregard, debate over, or enlighten ourselves with. However, the existence of subtle themes that are difficult to discern had inspired us to utilize four machine learning algorithms: Decision Trees, Random Forest, Logistic Regression, and Support Vec- tor Classifier (SVC) to aid in their detection. Trained on the ValueEval data set as a multi- label classification problem, the supervised ma- chine learning models did not perform as well as expected, with F1 metrics hovering from 0.0 to 0.04 for each value. Noting this, the lim- itations and weaknesses of our approach are discussed in our paper.

pdf abs
azaad@BND at SemEval-2023 Task 2: How to Go from a Simple Transformer Model to a Better Model to Get Better Results in Natural Language Processing
Reza Ahmadi | Shiva Arefi | Mohammad Jafarabad

In this article, which was prepared for the sameval2023 competition (task number 2), information about the implementation techniques of the transformer model and the use of the pre-trained BERT model in order to identify the named entity (NER) in the English language, has been collected and also the implementation method is explained. Finally, it led to an F1 score of about 57% for Fine-grained and 72% for Coarse-grained in the dev data. In the final test data, F1 score reached 50%.

pdf abs
PingAnLifeInsurance at SemEval-2023 Task 10: Using Multi-Task Learning to Better Detect Online Sexism
Mengyuan Zhou

This paper describes our system used in the SemEval-2023 Task 10: Towards ExplainableDetection of Online Sexism (Kirk et al., 2023). The harmful effects of sexism on the internet have impacted both men and women, yet current research lacks a fine-grained classification of sexist content. The task involves three hierarchical sub-tasks, which we addressed by employing a multitask-learning framework. To further enhance our system’s performance, we pre-trained the roberta-large (Liu et al., 2019b) and deberta-v3-large (He et al., 2021) models on two million unlabeled data, resulting in significant improvements on sub-tasks A and C. In addition, the multitask-learning approach boosted the performance of our models on subtasks A and B. Our system exhibits promising results in achieving explainable detection of online sexism, attaining a test f1-score of 0.8746 on sub-task A (ranking 1st on the leaderboard), and ranking 5th on sub-tasks B and C.

pdf abs
SemEval-2023 Task 10: Explainable Detection of Online Sexism
Hannah Kirk | Wenjie Yin | Bertie Vidgen | Paul Röttger

Online sexism is a widespread and harmful phenomenon. Automated tools can assist the detection of sexism at scale. Binary detection, however, disregards the diversity of sexist content, and fails to provide clear explanations for why something is sexist. To address this issue, we introduce SemEval Task 10 on the Explainable Detection of Online Sexism (EDOS). We make three main contributions: i) a novel hierarchical taxonomy of sexist content, which includes granular vectors of sexism to aid explainability; ii) a new dataset of 20,000 social media comments with fine-grained labels, along with larger unlabelled datasets for model adaptation; and iii) baseline models as well as an analysis of the methods, results and errors for participant submissions to our task.

pdf abs
Ertim at SemEval-2023 Task 2: Fine-tuning of Transformer Language Models and External Knowledge Leveraging for NER in Farsi, English, French and Chinese
Kevin Deturck | Pierre Magistry | Bénédicte Diot-Parvaz Ahmad | Ilaine Wang | Damien Nouvel | Hugo Lafayette

Transformer language models are now a solid baseline for Named Entity Recognition and can be significantly improved by leveraging complementary resources, either by integrating external knowledge or by annotating additional data. In a preliminary step, this work presents experiments on fine-tuning transformer models. Then, a set of experiments has been conducted with a Wikipedia-based reclassification system. Additionally, we conducted a small annotation campaign on the Farsi language to evaluate the impact of additional data. These two methods with complementary resources showed improvements compared to fine-tuning only.

This paper describes the results of SemEval 2023 task 7 – Multi-Evidence Natural Language Inference for Clinical Trial Data (NLI4CT) – consisting of 2 tasks, a Natural Language Inference (NLI) task, and an evidence selection task on clinical trial data. The proposed challenges require multi-hop biomedical and numerical reasoning, which are of significant importance to the development of systems capable of large-scale interpretation and retrieval of medical evidence, to provide personalized evidence-based care. Task 1, the entailment task, received 643 submissions from 40 participants, and Task 2, the evidence selection task, received 364 submissions from 23 participants. The tasks are challenging, with the majority of submitted systems failing to significantly outperform the majority class baseline on the entailment task, and we observe significantly better performance on the evidence selection task than on the entailment task. Increasing the number of model parameters leads to a direct increase in performance, far more significant than the effect of biomedical pre-training. Future works could explore the limitations of large models for generalization and numerical inference, and investigate methods to augment clinical datasets to allow for more rigorous testing and to facilitate fine-tuning. We envisage that the dataset, models, and results of this task will be useful to the biomedical NLI and evidence retrieval communities. The dataset, competition leaderboard, and website are publicly available.

pdf abs
SemEval-2023 Task 1: Visual Word Sense Disambiguation
Alessandro Raganato | Iacer Calixto | Asahi Ushio | Jose Camacho-Collados | Mohammad Taher Pilehvar

This paper presents the Visual Word Sense Disambiguation (Visual-WSD) task. The objective of Visual-WSD is to identify among a set of ten images the one that corresponds to the intended meaning of a given ambiguous word which is accompanied with minimal context. The task provides datasets for three different languages: English, Italian, and Farsi.We received a total of 96 different submissions. Out of these, 40 systems outperformed a strong zero-shot CLIP-based baseline. Participating systems proposed different zero- and few-shot approaches, often involving generative models and data augmentation. More information can be found on the task’s website: \url{https://raganato.github.io/vwsd/}.

Intimacy is an important social aspect of language. Computational modeling of intimacy in language could help many downstream applications like dialogue systems and offensiveness detection. Despite its importance, resources and approaches on modeling textual intimacy remain rare. To address this gap, we introduce MINT, a new Multilingual intimacy analysis dataset covering 13,372 tweets in 10 languages including English, French, Spanish, Italian, Portuguese, Korean, Dutch, Chinese, Hindi, and Arabic along with SemEval 2023 Task 9: Multilingual Tweet Intimacy Analysis. Our task attracted 45 participants from around the world. While the participants are able to achieve overall good performance on languages in the training set, zero-shot prediction of intimacy in unseen languages remains challenging. Here we provide an overview of the task, summaries of the common approaches, and potential future directions on modeling intimacy across languages. All the relevant resources are available at https: //sites.google.com/umich.edu/ semeval-2023-tweet-intimacy.

pdf abs
SemEval-2023 Task 2: Fine-grained Multilingual Named Entity Recognition (MultiCoNER 2)
Besnik Fetahu | Sudipta Kar | Zhiyu Chen | Oleg Rokhlenko | Shervin Malmasi

We present the findings of SemEval-2023 Task 2 on Fine-grained Multilingual Named Entity Recognition (MultiCoNER 2). Divided into 13 tracks, the task focused on methods to identify complex fine-grained named entities (like WRITTENWORK, VEHICLE, MUSICALGRP) across 12 languages, in both monolingual and multilingual scenarios, as well as noisy settings. The task used the MultiCoNER V2 dataset, composed of 2.2 million instances in Bangla, Chinese, English, Farsi, French, German, Hindi, Italian., Portuguese, Spanish, Swedish, and Ukrainian. MultiCoNER 2 was one of the most popular tasks of SemEval-2023. It attracted 842 submissions from 47 teams, and 34 teams submitted system papers. Results showed that complex entity types such as media titles and product names were the most challenging. Methods fusing external knowledge into transformer models achieved the best performance, and the largest gains were on the Creative Work and Group classes, which are still challenging even with external knowledge. Some fine-grained classes proved to be more challenging than others, such as SCIENTIST, ARTWORK, and PRIVATECORP. We also observed that noisy data has a significant impact on model performance, with an average drop of 10% on the noisy subset. The task highlights the need for future research on improving NER robustness on noisy data containing complex entities.

pdf abs
SemEval-2023 Task 8: Causal Medical Claim Identification and Related PIO Frame Extraction from Social Media Posts
Vivek Khetan | Somin Wadhwa | Byron Wallace | Silvio Amir

Identification of medical claims from user-generated text data is an onerous but essential step for various tasks including content moderation, and hypothesis generation. SemEval-2023 Task 8 is an effort towards building those capabilities and motivating further research in this direction. This paper summarizes the details and results of shared task 8 at SemEval-2023 which involved identifying causal medical claims and extracting related Populations, Interventions, and Outcomes (“PIO”) frames from social media (Reddit) text. This shared task comprised two subtasks: (1) Causal claim identification; and (2) PIO frame extraction. In total, seven teams participated in the task. Of the seven, six provided system descriptions which we summarize here. For the first subtask, the best approach yielded a macro-averaged F-1 score of 78.40, and for the second subtask, the best approach achieved token-level F-1 scores of 40.55 for Populations, 49.71 for Interventions, and 30.08 for Outcome frames.

pdf abs
SemEval-2023 Task 5: Clickbait Spoiling
Maik Fröbe | Benno Stein | Tim Gollub | Matthias Hagen | Martin Potthast

In this overview paper, we report on the second PAN~Clickbait Challenge hosted as Task~5 at SemEval~2023. The challenge’s focus is to better support social media users by automatically generating short spoilers that close the curiosity gap induced by a clickbait post. We organized two subtasks: (1) spoiler type classification to assess what kind of spoiler a clickbait post warrants (e.g., a phrase), and (2) spoiler generation to generate an actual spoiler for a clickbait post.

Argumentation is ubiquitous in natural language communication, from politics and media to everyday work and private life. Many arguments derive their persuasive power from human values, such as self-directed thought or tolerance, albeit often implicitly. These values are key to understanding the semantics of arguments, as they are generally accepted as justifications for why a particular option is ethically desirable. Can automated systems uncover the values on which an argument draws? To answer this question, 39 teams submitted runs to ValueEval’23. Using a multi-sourced dataset of over 9K arguments, the systems achieved F1-scores up to 0.87 (nature) and over 0.70 for three more of 20 universal value categories. However, many challenges remain, as evidenced by the low peak F1-score of 0.39 for stimulation, hedonism, face, and humility.

NLP datasets annotated with human judgments are rife with disagreements between the judges. This is especially true for tasks depending on subjective judgments such as sentiment analysis or offensive language detection. Particularly in these latter cases, the NLP community has come to realize that the common approach of reconciling’ these different subjective interpretations risks misrepresenting the evidence. Many NLP researchers have therefore concluded that rather than eliminating disagreements from annotated corpora, we should preserve themindeed, some argue that corpora should aim to preserve all interpretations produced by annotators. But this approach to corpus creation for NLP has not yet been widely accepted. The objective of the Le-Wi-Di series of shared tasks is to promote this approach to developing NLP models by providing a unified framework for training and evaluating with such datasets. We report on the second such shared task, which differs from the first edition in three crucial respects: (i) it focuses entirely on NLP, instead of both NLP and computer vision tasks in its first edition; (ii) it focuses on subjective tasks, instead of covering different types of disagreements as training with aggregated labels for subjective NLP tasks is in effect a misrepresentation of the data; and (iii) for the evaluation, we concentrated on soft approaches to evaluation. This second edition of Le-Wi-Di attracted a wide array of partici- pants resulting in 13 shared task submission papers.

We present the first Africentric SemEval Shared task, Sentiment Analysis for African Languages (AfriSenti-SemEval) - The dataset is available at https://github.com/afrisenti-semeval/afrisent-semeval-2023. AfriSenti-SemEval is a sentiment classification challenge in 14 African languages: Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili, Tigrinya, Twi, Xitsonga, and Yorb (Muhammad et al., 2023), using data labeled with 3 sentiment classes. We present three subtasks: (1) Task A: monolingual classification, which received 44 submissions; (2) Task B: multilingual classification, which received 32 submissions; and (3) Task C: zero-shot classification, which received 34 submissions. The best performance for tasks A and B was achieved by NLNDE team with 71.31 and 75.06 weighted F1, respectively. UCAS-IIE-NLP achieved the best average score for task C with 58.15 weighted F1. We describe the various approaches adopted by the top 10 systems and their approaches.

pdf abs
ITTC at SemEval 2023-Task 7: Document Retrieval and Sentence Similarity for Evidence Retrieval in Clinical Trial Data
Rahmad Mahendra | Damiano Spina | Karin Verspoor

This paper describes the submissions of the Natural Language Processing (NLP) team from the Australian Research Council Industrial Transformation Training Centre (ITTC) for Cognitive Computing in Medical Technologies to the SemEval 2023 Task 7, i.e., multi-evidence natural language inference for clinical trial data (NLI4CT). More specifically, we were working on subtask 2 whose objective is to identify the relevant parts of the premise from clinical trial report that justify the truth of information in the statement. We approach the evidence retrieval problem as a document retrieval and sentence similarity task. Our results show that the task poses some challenges which involve dealing with complex sentences and implicit evidences.

pdf abs
SemEval-2023 Task 3: Detecting the Category, the Framing, and the Persuasion Techniques in Online News in a Multi-lingual Setup
Jakub Piskorski | Nicolas Stefanovitch | Giovanni Da San Martino | Preslav Nakov

We describe SemEval-2023 task 3 on Detecting the Category, the Framing, and the Persuasion Techniques in Online News in a Multilingual Setup: the dataset, the task organization process, the evaluation setup, the results, and the participating systems. The task focused on news articles in nine languages (six known to the participants upfront: English, French, German, Italian, Polish, and Russian), and three additional ones revealed to the participants at the testing phase: Spanish, Greek, and Georgian). The task featured three subtasks: (1) determining the genre of the article (opinion, reporting, or satire), (2) identifying one or more frames used in an article from a pool of 14 generic frames, and (3) identify the persuasion techniques used in each paragraph of the article, using a taxonomy of 23 persuasion techniques. This was a very popular task: a total of 181 teams registered to participate, and 41 eventually made an official submission on the test set.

In populous countries, pending legal cases have been growing exponentially. There is a need for developing NLP-based techniques for processing and automatically understanding legal documents. To promote research in the area of Legal NLP we organized the shared task LegalEval - Understanding Legal Texts at SemEval 2023. LegalEval task has three sub-tasks: Task-A (Rhetorical Roles Labeling) is about automatically structuring legal documents into semantically coherent units, Task-B (Legal Named Entity Recognition) deals with identifying relevant entities in a legal document and Task-C (Court Judgement Prediction with Explanation) explores the possibility of automatically predicting the outcome of a legal case along with providing an explanation for the prediction. In total 26 teams (approx. 100 participants spread across the world) submitted systems paper. In each of the sub-tasks, the proposed systems outperformed the baselines; however, there is a lot of scope for improvement. This paper describes the tasks, and analyzes techniques proposed by various teams.