Mona Abdelazim

Also published as: Mona Azim


2026

Detecting polarization in online discourse is important for understanding social fragmentation , yet it remains difficult for Arabic due to dialect variation, informal writing, and implicit framing. In this paper, we study Arabic polarization modeling in the SemEval-2026 Task 9 (POLAR) setting, focusing on polarization detection (ST1) and polarization type classification (ST2). We compare three approaches: encoder fine-tuning, zero-shot prompting, and retrieval-augmented in-context learning (RAG-ICL), across six Arabic encoders and different LLMs. For ST1, RAG-ICL with Gemma-3-27b-it achieves the best result (test macro F1 = 0.83), while remaining competitive with the best fine-tuned encoder (0.82), and substantially outperforming zero-shot prompting. For ST2, a pipeline that first applies the best ST1 encoder as a hard filter and then performs RAG-ICL achieves a macro F1 = 0.62. Prompt-language effects are model-and task-dependent, with some settings doing better with English prompts and others with Arabic prompts. Chain-of-thought, self-refinement, and contrastive prompting do not outperform standard RAG-ICL.
Encoder-only transformer models remain widely used for discriminative NLP tasks, yet recent architectural advances have largely focused on English. In this work, we present AraModernBERT, an adaptation of the ModernBERT encoder architecture to Arabic, and study the impact of transtokenized embedding initialization and native long-context modeling up to 8,192 tokens. We show that transtokenization is essential for Arabic language modeling, yielding dramatic improvements in masked language modeling performance compared to non-transtokenized initialization. We further demonstrate that AraModernBERT supports stable and effective long-context modeling, achieving improved intrinsic language modeling performance at extended sequence lengths. Downstream evaluations on Arabic natural language understanding tasks, including inference, offensive language detection, question-question similarity, and named entity recognition, confirm strong transfer to discriminative and sequence labeling settings. Our results highlight practical considerations for adapting modern encoder architectures to Arabic and other languages written in Arabic-derived scripts.