We present the first publicly available, multidimensional corpus of Qatari Arabic that captures intra-dialectal variation across Urban and Bedouin speakers. While often grouped under the label of “Gulf Arabic”, Qatari Arabic exhibits rich phonological, lexical, and discourse-level differences shaped by gender, age, and sociocultural identity. Our dataset includes aligned speech and transcriptions from 255 speakers, stratified by gender and age, and collected through structured interviews on culturally salient topics such as education, heritage, and social norms. The corpus reveals systematic variation in pronunciation, vocabulary, and narrative style, offering insights for both sociolinguistic analysis and computational modeling. We also demonstrate its utility through preliminary experiments in the prediction of dialects and genders. This work provides the first large-scale, demographically balanced corpus of Qatari Arabic, laying a foundation for both sociolinguistic research and the development of dialect-aware NLP systems.
In this paper, we present our approach for FIGNEWS Subtask 1, which focuses on detecting bias in news media narratives about the Israel war on Gaza. We used a Large Language Model (LLM) and prompt engineering, using GPT-3.5 Turbo API, to create a model that automatically flags biased news media content with 99% accuracy. This approach provides Natural Language Processing (NLP) researchers with a robust and effective solution for automating bias detection in news media narratives using supervised learning algorithms. Additionally, this paper provides a detailed analysis of the labeled content, offering valuable insights into media bias in conflict reporting. Our work advances automated content analysis and enhances understanding of media bias.
This paper introduces a cross-domain and multi-dialectal stance corpus for Arabic that includes four regions in the Arab World and covers the main Arabic dialect groups. Our corpus consists of 4657 sentences manually annotated with each sentence’s stance towards a specific topic. For each region, we collected sentences related to two controversial topics. We annotated each sentence by at least two annotators to indicate if its stance favors the topic, is against it, or is neutral. Our corpus is well-balanced concerning dialect and stance. Approximately half of the sentences are in Modern Standard Arabic (MSA) for each region, and the other half is in the region’s respective dialect. We conducted several machine-learning experiments for stance detection using our new corpus. Our most successful model is the Multi-Layer Perceptron (MLP), using Unigram or TF-IDF extracted features, which yielded an F1-score of 0.66 and an accuracy score of 0.66. Compared with the most similar state-of-the-art dataset, our dataset outperformed in specific stance classes, particularly “neutral” and “against”.