Mouleeshuwarapprabu R


2026

The automated detection of LGBTQ+ phobia in social media memes is essential for fostering inclusive digital environments, yet it remains challenging due to the complex interplay of visual metaphors and multilingual text. We participated in the "Homophobia and Transphobia Meme Classification" shared task at LT-EDI 2026, evaluating a multimodal architecture across English, Hindi, and Chinese tracks. Our system employs a late-fusion strategy: XLM-RoBERTa encodes OCR-extracted text into a representation ht ∈ ℝ768 , while CLIP extracts visual features hv ∈ ℝ512. These are concatenated into a joint vector z = [ht ⊕ hv] ∈ ℝ1280 and processed via a non-linear multilayer perceptron to capture cross-modal interactions.The system demonstrated robust performance in high-resource contexts, securing 3rd rank in the Chinese track (Macro F1: 0.7371) and 4th rank in the English track (Macro F1: 0.6121). In contrast, the Hindi track results (Macro F1: 0.1616) revealed significant challenges related to script complexity and class imbalance. These findings underscore the effectiveness of global transformer-based models for multimodal reasoning while highlighting the ongoing need for specialized linguistic refinement in low-resource and diverse script environments