Mehrub Awan

2025

pdf bib abs
Slur and Emoji Aware Models for Hate and Sentiment Detection in Roman Urdu Transgender Discourse
Muhammad Owais Raza | Aqsa Umar | Mehrub Awan
Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages

The rise of social media has amplified both the visibility and vulnerability of marginalized communities, particularly the transgender population in South Asia. While hate speech detection has seen considerable progress in high resource languages like English, under-resourced and code mixed languages such as Roman Urdu remains significantly understudied. This paper presents a novel Roman Urdu dataset derived from Instagram comments on transgender related content, capturing the intricacies of multilingual, code-mixed, and emoji-laden social discourse. We introduce a transphobic slur lexicon specific to Roman Urdu and a semantic emoji taxonomy grounded in contextual usage. These resources are utilized to perform fine-grained classification of sentiment and hate speech using both traditional machine learning models and transformer-based architectures. The findings show that our custom-trained BERT-based models, Senti-RU-Bert and Hate-RU-Bert, best performance, with F1 scores of 80.39% for sentiment classification and 77.34% for hate speech classification. Ablation studies reveal consistent performance gains when slur and emoji features are included.

Co-authors

Muhammad Owais Raza 1
Aqsa Umar 1

Venues

lowresnlp1
ws1

Fix author