Mehrub Awan
2025
Slur and Emoji Aware Models for Hate and Sentiment Detection in Roman Urdu Transgender Discourse
Muhammad Owais Raza
|
Aqsa Umar
|
Mehrub Awan
Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages
The rise of social media has amplified both the visibility and vulnerability of marginalized communities, particularly the transgender population in South Asia. While hate speech detection has seen considerable progress in high resource languages like English, under-resourced and code mixed languages such as Roman Urdu remains significantly understudied. This paper presents a novel Roman Urdu dataset derived from Instagram comments on transgender related content, capturing the intricacies of multilingual, code-mixed, and emoji-laden social discourse. We introduce a transphobic slur lexicon specific to Roman Urdu and a semantic emoji taxonomy grounded in contextual usage. These resources are utilized to perform fine-grained classification of sentiment and hate speech using both traditional machine learning models and transformer-based architectures. The findings show that our custom-trained BERT-based models, Senti-RU-Bert and Hate-RU-Bert, best performance, with F1 scores of 80.39% for sentiment classification and 77.34% for hate speech classification. Ablation studies reveal consistent performance gains when slur and emoji features are included.