Sana Shams
2026
Instruction-Tuned Urdu LLMs: Efficient Adaptation of Llama Models and Evaluation Resources for Urdu
Munief Hassan Tahir | Sana Shams | Sarmad Hussain | Miriam Butt
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Munief Hassan Tahir | Sana Shams | Sarmad Hussain | Miriam Butt
Proceedings of the Fifteenth Language Resources and Evaluation Conference
This paper presents UrduLLaMA 1.1 and UrduLLaMA 1.1 Tiny, two instruction-tuned large language models (LLMs) designed to advance natural language processing for Urdu, a low-resource language with limited representation in multilingual corpora. These instruction-tuned models are derived from Llama-3.1-8B-Instruct and Llama-3.2-3B-Instruct architectures, respectively by conducting continual pretraining on 800 million diverse Urdu tokens curated from public and proprietary sources, followed by Supervised Fine-Tuning (SFT) using LoRA on 432K Urdu instructions spanning diverse NLP tasks. Rigorous evaluation across 14 culturally-specific domains using our novel Urdu LLM Evaluation Dataset demonstrates superior performance. UrduLLaMA 1.1 achieves 65.3 average accuracy (GPT-5 Nano evaluation), outperforming its Llama-3.1-8B-Instruct base (50.7) across all categories and surpassing Llama-3.3-70B-Instruct (62.7) in 8 out of 14 domains. UrduLLaMA 1.1 Tiny transforms Llama-3.2-3B-Instruct (38.8) into a (61.2) performer. Human evaluation by native Urdu linguists confirms these gains (3.51/5 vs. 2.61/5 base). Our results validate targeted adaptation strategies combining continual pretraining with instruction tuning as computationally efficient solutions for low-resource languages, enabling state-of-the-art Urdu LLM models with accessible hardware.
Building and Evaluating a High Quality Parallel Corpus for English Urdu Low Resource Machine Translation
Munief Hassan Tahir | Hunain Azam | Sana Shams | Sarmad Hussain
Proceedings for the Ninth Workshop on Technologies for Machine Translation of Low Resource Languages (LoResMT 2026)
Munief Hassan Tahir | Hunain Azam | Sana Shams | Sarmad Hussain
Proceedings for the Ninth Workshop on Technologies for Machine Translation of Low Resource Languages (LoResMT 2026)
Low-resource languages like Urdu suffer from limited high quality parallel data for machine translation. We introduce a curated English–Urdu corpus of 80,749 high-fidelity sentence pairs across 18 diverse domains, built via ethical collection, manual alignment, deduplication, and strict length-based filtering (AWCD ≤ 5). The corpus is converted into a bidirectional SFT dataset with bilingual (English/Urdu) instructions to enhance prompt-language robustness. Fine-tuning Llama-3.1-8B-Instruct (Llama-FT) and UrduLlama 1.1 (UrduLlama-FT) yields major gains over the baseline. sacreBLEU scores reach 24.65–25.24 (En→Ur) and 76.14–77.97 (Ur→En) for Llama-FT, with minimal sensitivity to prompt language. Blind human evaluation on 90 sentences per direction confirms substantial perceptual improvements. Results demonstrate the value of clean parallel data and bilingual instruction tuning, revealing complementary benefits of general SFT versus Urdu specific pretraining. This work provides a reproducible resource and pipeline to advance Urdu machine translation and similar low-resource languages.
Scripting History: A Diachronic Urdu Text and Image Corpus from the 18Th to 19Th Centuries
Sana Shams | Sahar Rauf | Asad Mustafa | Muhammad Zeeshan Javed | Qurat-ul-Ain Akram | Sarmad Hussain | Miriam Butt
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Sana Shams | Sahar Rauf | Asad Mustafa | Muhammad Zeeshan Javed | Qurat-ul-Ain Akram | Sarmad Hussain | Miriam Butt
Proceedings of the Fifteenth Language Resources and Evaluation Conference
This paper presents the Diachronic Urdu Text and Image Corpus, a one-million-word resource covering Urdu’s development across the 18th and 19th centuries. The corpus is compiled from 328 printed books published between 1800 and 1950, representing a diverse range of genres, authors, and publishers. A 140,000-word sub-corpus has been manually annotated with Urdu part-of-speech tags to facilitate linguistic and computational analysis. The dataset enables systematic investigation of historical changes in Urdu orthography, morphology, and syntax, providing new insights into the language’s history and standardization. To preserve the original printed form, each text is paired with its corresponding page image, creating the first multimodal diachronic corpus for Urdu. The paper outlines the corpus compilation pipeline, digitization workflow, text-image alignment, and annotation strategy designed to ensure accuracy, consistency, and authenticity. This multimodal Urdu diachronic corpus establishes a benchmark for research in computational linguistics, digital humanities, and South Asian language technology, supporting corpus-based exploration of Urdu’s linguistic history and cultural heritage.
2025
Benchmarking the Performance of Pre-trained LLMs across Urdu NLP Tasks
Munief Hassan Tahir | Sana Shams | Layba Fiaz | Farah Adeeba | Sarmad Hussain
Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025)
Munief Hassan Tahir | Sana Shams | Layba Fiaz | Farah Adeeba | Sarmad Hussain
Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025)
Large Language Models (LLMs) pre-trained on multilingual data have revolutionized natural language processing research, by transitioning from languages and task specific model pipelines to a single model adapted on a variety of tasks. However majority of existing multilingual NLP benchmarks for LLMs provide evaluation data in only few languages with little linguistic diversity. In addition these benchmarks lack quality assessment against the respective state-of the art models. This study presents an in-depth examination of 7 prominent LLMs: GPT-3.5-turbo, Llama 2-7B-Chat, Llama 3.1-8B, Bloomz 3B, Bloomz 7B1, Ministral-8B and Whisper (Large, medium and small variant) across 17 tasks using 22 datasets, 13.8 hours of speech, in a zero-shot setting, and their performance against state-of-the-art (SOTA) models, has been compared and analyzed. Our experiments show that SOTA models currently outperform encoder-decoder models in majority of Urdu NLP tasks under zero-shot settings. However, comparing Llama 3.1-8B over prior version Llama 2-7B-Chat, we can deduce that with improved language coverage, LLMs can surpass these SOTA models. Our results emphasize that models with fewer parameters but richer language-specific data, like Llama 3.1-8B, often outperform larger models with lower language diversity, such as GPT-3.5, in several tasks.
A Brief Overview of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL)
Kengatharaiyer Sarveswaran | Surendrabikram Thapa | Sana Shams | Ashwini Vaidya | Bal Krishna Bal
Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025)
Kengatharaiyer Sarveswaran | Surendrabikram Thapa | Sana Shams | Ashwini Vaidya | Bal Krishna Bal
Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025)
In this paper, we provide a brief summary of the inaugural workshop on Challenges in Processing South Asian Languages (CHiPSAL) held as part of COLING 2025. The workshop included regular papers, invited keynotes, and shared task papers, fostering a collaborative platform for exploring challenges in processing South Asian languages. The shared task focused on Devanagari-script language understanding, encompassing subtasks on language identification, hate speech detection, and target classification. This workshop series aims to address linguistic and cultural nuances, resource constraints, and orthographic complexities in low-resource South Asian languages while advancing NLP research and promoting multilingual inclusivity.