Munief Hassan Tahir

2026

Instruction-Tuned Urdu LLMs: Efficient Adaptation of Llama Models and Evaluation Resources for Urdu
Munief Hassan Tahir | Sana Shams | Sarmad Hussain | Miriam Butt
Proceedings of the Fifteenth Language Resources and Evaluation Conference

This paper presents UrduLLaMA 1.1 and UrduLLaMA 1.1 Tiny, two instruction-tuned large language models (LLMs) designed to advance natural language processing for Urdu, a low-resource language with limited representation in multilingual corpora. These instruction-tuned models are derived from Llama-3.1-8B-Instruct and Llama-3.2-3B-Instruct architectures, respectively by conducting continual pretraining on 800 million diverse Urdu tokens curated from public and proprietary sources, followed by Supervised Fine-Tuning (SFT) using LoRA on 432K Urdu instructions spanning diverse NLP tasks. Rigorous evaluation across 14 culturally-specific domains using our novel Urdu LLM Evaluation Dataset demonstrates superior performance. UrduLLaMA 1.1 achieves 65.3 average accuracy (GPT-5 Nano evaluation), outperforming its Llama-3.1-8B-Instruct base (50.7) across all categories and surpassing Llama-3.3-70B-Instruct (62.7) in 8 out of 14 domains. UrduLLaMA 1.1 Tiny transforms Llama-3.2-3B-Instruct (38.8) into a (61.2) performer. Human evaluation by native Urdu linguists confirms these gains (3.51/5 vs. 2.61/5 base). Our results validate targeted adaptation strategies combining continual pretraining with instruction tuning as computationally efficient solutions for low-resource languages, enabling state-of-the-art Urdu LLM models with accessible hardware.

pdf bib abs

Building and Evaluating a High Quality Parallel Corpus for English Urdu Low Resource Machine Translation
Munief Hassan Tahir | Hunain Azam | Sana Shams | Sarmad Hussain
Proceedings for the Ninth Workshop on Technologies for Machine Translation of Low Resource Languages (LoResMT 2026)

Low-resource languages like Urdu suffer from limited high quality parallel data for machine translation. We introduce a curated English–Urdu corpus of 80,749 high-fidelity sentence pairs across 18 diverse domains, built via ethical collection, manual alignment, deduplication, and strict length-based filtering (AWCD ≤ 5). The corpus is converted into a bidirectional SFT dataset with bilingual (English/Urdu) instructions to enhance prompt-language robustness. Fine-tuning Llama-3.1-8B-Instruct (Llama-FT) and UrduLlama 1.1 (UrduLlama-FT) yields major gains over the baseline. sacreBLEU scores reach 24.65–25.24 (En→Ur) and 76.14–77.97 (Ur→En) for Llama-FT, with minimal sensitivity to prompt language. Blind human evaluation on 90 sentences per direction confirms substantial perceptual improvements. Results demonstrate the value of clean parallel data and bilingual instruction tuning, revealing complementary benefits of general SFT versus Urdu specific pretraining. This work provides a reproducible resource and pipeline to advance Urdu machine translation and similar low-resource languages.

2025

pdf bib abs

Benchmarking the Performance of Pre-trained LLMs across Urdu NLP Tasks
Munief Hassan Tahir | Sana Shams | Layba Fiaz | Farah Adeeba | Sarmad Hussain
Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025)

Large Language Models (LLMs) pre-trained on multilingual data have revolutionized natural language processing research, by transitioning from languages and task specific model pipelines to a single model adapted on a variety of tasks. However majority of existing multilingual NLP benchmarks for LLMs provide evaluation data in only few languages with little linguistic diversity. In addition these benchmarks lack quality assessment against the respective state-of the art models. This study presents an in-depth examination of 7 prominent LLMs: GPT-3.5-turbo, Llama 2-7B-Chat, Llama 3.1-8B, Bloomz 3B, Bloomz 7B1, Ministral-8B and Whisper (Large, medium and small variant) across 17 tasks using 22 datasets, 13.8 hours of speech, in a zero-shot setting, and their performance against state-of-the-art (SOTA) models, has been compared and analyzed. Our experiments show that SOTA models currently outperform encoder-decoder models in majority of Urdu NLP tasks under zero-shot settings. However, comparing Llama 3.1-8B over prior version Llama 2-7B-Chat, we can deduce that with improved language coverage, LLMs can surpass these SOTA models. Our results emphasize that models with fewer parameters but richer language-specific data, like Llama 3.1-8B, often outperform larger models with lower language diversity, such as GPT-3.5, in several tasks.

Co-authors

Layba Fiaz 1

Venues

Fix author