Scripting History: A Diachronic Urdu Text and Image Corpus from the 18Th to 19Th Centuries

Sana Shams, Sahar Rauf, Asad Mustafa, Muhammad Zeeshan Javed, Qurat-ul-Ain Akram, Sarmad Hussain, Miriam Butt


Abstract
This paper presents the Diachronic Urdu Text and Image Corpus, a one-million-word resource covering Urdu’s development across the 18th and 19th centuries. The corpus is compiled from 328 printed books published between 1800 and 1950, representing a diverse range of genres, authors, and publishers. A 140,000-word sub-corpus has been manually annotated with Urdu part-of-speech tags to facilitate linguistic and computational analysis. The dataset enables systematic investigation of historical changes in Urdu orthography, morphology, and syntax, providing new insights into the language’s history and standardization. To preserve the original printed form, each text is paired with its corresponding page image, creating the first multimodal diachronic corpus for Urdu. The paper outlines the corpus compilation pipeline, digitization workflow, text-image alignment, and annotation strategy designed to ensure accuracy, consistency, and authenticity. This multimodal Urdu diachronic corpus establishes a benchmark for research in computational linguistics, digital humanities, and South Asian language technology, supporting corpus-based exploration of Urdu’s linguistic history and cultural heritage.
Anthology ID:
2026.lrec-main.127
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
1622–1632
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.127/
DOI:
Bibkey:
Cite (ACL):
Sana Shams, Sahar Rauf, Asad Mustafa, Muhammad Zeeshan Javed, Qurat-ul-Ain Akram, Sarmad Hussain, and Miriam Butt. 2026. Scripting History: A Diachronic Urdu Text and Image Corpus from the 18Th to 19Th Centuries. International Conference on Language Resources and Evaluation, main:1622–1632.
Cite (Informal):
Scripting History: A Diachronic Urdu Text and Image Corpus from the 18Th to 19Th Centuries (Shams et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.127.pdf