Erhan Sezerer


2024

pdf
MultiPICo: Multilingual Perspectivist Irony Corpus
Silvia Casola | Simona Frenda | Soda Lo | Erhan Sezerer | Antonio Uva | Valerio Basile | Cristina Bosco | Alessandro Pedrani | Chiara Rubagotti | Viviana Patti | Davide Bernardi
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recently, several scholars have contributed to the growth of a new theoretical framework in NLP called perspectivism. This approach aimsto leverage data annotated by different individuals to model diverse perspectives that affect their opinions on subjective phenomena such as irony. In this context, we propose MultiPICo, a multilingual perspectivist corpus of ironic short conversations in different languages andlinguistic varieties extracted from Twitter and Reddit. The corpus includes sociodemographic information about its annotators. Our analysis of the annotated corpus shows how different demographic cohorts may significantly disagree on their annotation of irony and how certain cultural factors influence the perception of the phenomenon and the agreement on the annotation. Moreover, we show how disaggregated annotations and rich annotator metadata can be exploited to benchmark the ability of large language models to recognize irony, their positionality with respect to sociodemographic groups, and the efficacy of perspective-taking prompting for irony detection in multiple languages.

2019

pdf
A Turkish Dataset for Gender Identification of Twitter Users
Erhan Sezerer | Ozan Polatbilek | Selma Tekir
Proceedings of the 13th Linguistic Annotation Workshop

Author profiling is the identification of an author’s gender, age, and language from his/her texts. With the increasing trend of using Twitter as a means to express thought, profiling the gender of an author from his/her tweets has become a challenge. Although several datasets in different languages have been released on this problem, there is still a need for multilingualism. In this work, we propose a dataset of tweets of Turkish Twitter users which are labeled with their gender information. The dataset has 3368 users in training set and 1924 users in test set where each user has 100 tweets. The dataset is publicly available.