Asad Mustafa
2026
Scripting History: A Diachronic Urdu Text and Image Corpus from the 18Th to 19Th Centuries
Sana Shams | Sahar Rauf | Asad Mustafa | Muhammad Zeeshan Javed | Qurat-ul-Ain Akram | Sarmad Hussain | Miriam Butt
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Sana Shams | Sahar Rauf | Asad Mustafa | Muhammad Zeeshan Javed | Qurat-ul-Ain Akram | Sarmad Hussain | Miriam Butt
Proceedings of the Fifteenth Language Resources and Evaluation Conference
This paper presents the Diachronic Urdu Text and Image Corpus, a one-million-word resource covering Urdu’s development across the 18th and 19th centuries. The corpus is compiled from 328 printed books published between 1800 and 1950, representing a diverse range of genres, authors, and publishers. A 140,000-word sub-corpus has been manually annotated with Urdu part-of-speech tags to facilitate linguistic and computational analysis. The dataset enables systematic investigation of historical changes in Urdu orthography, morphology, and syntax, providing new insights into the language’s history and standardization. To preserve the original printed form, each text is paired with its corresponding page image, creating the first multimodal diachronic corpus for Urdu. The paper outlines the corpus compilation pipeline, digitization workflow, text-image alignment, and annotation strategy designed to ensure accuracy, consistency, and authenticity. This multimodal Urdu diachronic corpus establishes a benchmark for research in computational linguistics, digital humanities, and South Asian language technology, supporting corpus-based exploration of Urdu’s linguistic history and cultural heritage.
2014
The CLE Urdu POS Tagset
Saba Urooj | Sarmad Hussain | Asad Mustafa | Rahila Parveen | Farah Adeeba | Tafseer Ahmed Khan | Miriam Butt | Annette Hautli
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Saba Urooj | Sarmad Hussain | Asad Mustafa | Rahila Parveen | Farah Adeeba | Tafseer Ahmed Khan | Miriam Butt | Annette Hautli
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
The paper presents a design schema and details of a new Urdu POS tagset. This tagset is designed due to challenges encountered in working with existing tagsets for Urdu. It uses tags that judiciously incorporate information about special morpho-syntactic categories found in Urdu. With respect to the overall naming schema and the basic divisions, the tagset draws on the Penn Treebank and a Common Tagset for Indian Languages. The resulting CLE Urdu POS Tagset consists of 12 major categories with subdivisions, resulting in 32 tags. The tagset has been used to tag 100k words of the CLE Urdu Digest Corpus, giving a tagging accuracy of 96.8%.