Multimodal Transformer Framework for Multilingual Harmful Meme Classification

Charmathi Rajkumar, Malliga Subramanian, Bharathi Raja Chakravarthi


Abstract
The rapid expansion of social media platforms has led to a significant increase in the spread of harmful content, including misogynistic, homophobic, and transphobic memes. Detecting such content is challenging because memes often combine textual and visual elements and frequently appear in multilingual and culturally diverse contexts. This study proposes a multimodal transformer-based framework for multilingual harmful meme classification that integrates textual and visual representations to improve detection performance. The proposed architecture employs XLM-RoBERTa for multilingual text encoding and the Swin Transformer for hierarchical visual feature extraction. A cross-attention fusion mechanism is introduced to enable meaningful interaction between textual and visual modalities. The fused representation is then processed through a classification layer to perform multi-class prediction. Experiments are conducted across multiple datasets covering eight languages and three harmful content categories: misogyny, homophobia/transphobia, and hate speech. The model is evaluated using the macro-F1 score and demonstrates consistent improvements over baseline multimodal systems across both high-resource and low-resource languages. The results highlight the effectiveness of transformer-based multimodal architectures in capturing implicit and contextual harmful signals present in memes. The study contributes to the development of robust multilingual systems for harmful content detection and supports efforts toward creating safer and more inclusive online environments.
Anthology ID:
2026.ltedi-1.9
Volume:
Proceedings of the Sixth Workshop on Language Technology for Equality, Diversity, Inclusion
Month:
July
Year:
2026
Address:
Virtual (Online)
Editors:
Bharathi Raja Chakravarthi, Bharathi B, Paul Buitelaar, Durairaj Thenmozhi, Miguel Ángel García Cumbreras, Salud María Jiménez Zafra
Venues:
LTEDI | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
99–107
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.ltedi-1.9/
DOI:
Bibkey:
Cite (ACL):
Charmathi Rajkumar, Malliga Subramanian, and Bharathi Raja Chakravarthi. 2026. Multimodal Transformer Framework for Multilingual Harmful Meme Classification. In Proceedings of the Sixth Workshop on Language Technology for Equality, Diversity, Inclusion, pages 99–107, Virtual (Online). Association for Computational Linguistics.
Cite (Informal):
Multimodal Transformer Framework for Multilingual Harmful Meme Classification (Rajkumar et al., LTEDI 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.ltedi-1.9.pdf