Muhammad Sohaib Ayub


2026

Medical AI systems increasingly rely on large language models (LLMs), yet their deployment in linguistically diverse regions remains unexplored. We address this gap by introducing U-MIRAGE, the first medical question-answering benchmark for Urdu and Roman Urdu. Urdu is the 11th most spoken language (with over 246 million speakers) worldwide. Our systematic evaluation of six state-of-the-art LLMs reveals three main findings. (1) 6% to 10% drop in performance when moving from English to Urdu variants, even though medical knowledge should theoretically transfer across languages. (2) Chain-of-Thought (CoT) prompting improves small models by 8% to 20%, while surprisingly the larger models’ performance degraded by up to 3%. (3) Quantized small models fail catastrophically in low-resource languages, achieving near-random accuracy regardless of various prompting strategies. These findings challenge core assumptions about multilingual medical AI systems. Roman Urdu consistently outperforms standard Urdu script, suggesting orthographic alignment with pre-training data matters more than linguistic proximity. CoT prompting effectiveness depends critically on model architecture rather than task complexity alone. CoT prompting effectiveness depends critically on model architecture rather than task complexity alone. Our contributions are threefold: (1) U-MIRAGE, (2) systematic benchmarking of LLMs for Urdu and Roman Urdu medical reasoning, and (3) empirical analysis of CoT prompting in low-resource contexts. Our code and datasets are publicly available.
Text classification in low-resource right-to-left languages faces significant challenges due to the scarcity of annotated data and the morphological richness of languages such as Arabic, Urdu, Sindhi, and Pashto. Arabic and Urdu alone are spoken by over 380+ million and 246+ million people worldwide, respectively. Pashto is the national language of Afghanistan, highlighting the importance of effective language technologies. While multilingual Pre-trained Language Models (PLMs) have shown promising results, they typically require extensive labeled datasets and computationally expensive fine-tuning to achieve better performance. Such limitations make these PLMs impractical for the low-resource settings described above. Therefore, we employ a few-shot strategy (zero, 4, or 8 shots) to achieve results comparable to those of standard fine-tuning. In this work, we propose MaskedVerbalizer, a novel technique designed for few-shot text classification. Our method introduces an automatic verbalizer construction approach that generates class-specific label words in 4-shot settings, eliminating the need for extensive manual intervention. Despite maintaining a simple model architecture, MaskedVerbalizer achieves effective performance in classification benchmarks. Experimental results demonstrate that our method effectively addresses the core challenges of low-resource text classification, providing a practical, computationally efficient solution. We achieved accuracies of 90.43% and 92.72% with mBERT and XLM-RoBERTa, respectively, representing improvements of 25–30% over soft and automatic verbalizers. The code for MaskedVerbalizer is publicly available at https://github.com/Furqann-hue/MV.

2024

Cybercrime is a serious and growing threat affecting millions of people worldwide. Detecting cybercrimes from text messages is challenging, as it requires understanding the linguistic and cultural nuances of different languages and regions. Roman Urdu is a widely used language in Pakistan and other South Asian countries, however, it lacks sufficient resources and tools for natural language processing and cybercrime detection. To address this problem, we make three main contributions in this paper. (1) We create and release CRU, a benchmark dataset for text-based cybercrime detection in Roman Urdu, which covers a number of cybercrimes as defined by the Prevention of Electronic Crimes Act (PECA) of Pakistan. This dataset is annotated by experts following a standardized procedure based on Pakistan’s legal framework. (2) We perform experiments on four pre-trained language models (PLMs) for cybercrime text classification in Roman Urdu. Our results show that xlm-roberta-base is the best model for this task, achieving the highest performance on all metrics. (3) We explore the utility of prompt engineering techniques, namely prefix and cloze prompts, for enhancing the performance of PLMs for low-resource languages such as Roman Urdu. We analyze the impact of different prompt shapes and k-shot settings on the performance of xlm-roberta-base and bert-base-multilingual-cased. We find that prefix prompts are more effective than cloze prompts for Roman Urdu classification tasks, as they provide more contextually relevant completions for the models. Our work provides useful insights and resources for future research on cybercrime detection and text classification in low-resource languages.
In this era of multimedia dominance, the surge of multimodal content on social media has transformed our methods of communication and information exchange. With the widespread use of multimedia content, the ability to effectively summarize this multimodal content is crucial for enhancing consumption, searchability, and retrieval. The scarcity of such training datasets has been a barrier to research in this area, especially for low-resource languages like Urdu. To address this gap, this paper introduces “UrduMASD”, a video-based Urdu multimodal abstractive text summarization dataset. The dataset contains 15,374 collections of videos, audio, titles, transcripts, and corresponding text summaries. To ensure the quality of the dataset, intrinsic evaluation metrics such as Abstractivity, Compression, Redundancy, and Semantic coherence have been employed. It was observed that our dataset surpasses existing datasets on numerous key quality metrics. Additionally, we present baseline results achieved using both text-based and state-of-the-art multimodal summarization models. On adding visual information, an improvement of 2.6% was observed in the ROUGE scores, highlighting the efficacy of utilizing multimodal inputs for summarization. To the best of our knowledge, this is the first dataset in Urdu that provides video-based multimodal data for abstractive text summarization, making it a valuable resource for advancing research in this field.