Corpora Generation for Urdu Grammatical Error Correction

Syed Ahad, Burhanuddin Aliasghar Ezzi, Muhammad Arsalan Hussain, Sandesh Kumar, Abdul Samad


Abstract
Grammatical Error Correction (GEC) for Urdu remains an under-researched area due to the lack of annotated datasets. This paper addresses the challenge of generating a robust corpus for fine-tuning deep learning models aimed at Urdu GEC. We propose a method for synthesizing a large dataset by collecting errors from the Urdu WikiEdits history, learning from them, and inserting similar errors in grammatically correct sentences to generate incorrect sentences with grammatical errors, hence creating a pair of grammatically correct and incorrect sentences. We introduce UrduGEC-Synthetic, a synthetically generated dataset produced through this pipeline. Furthermore, we introduce UrduGEC-Gold, a Gold Dataset by extracting errors from exam copies of students. Finally, we also fine-tuned various models on UrduGEC-Synthetic and evaluated them against UrduGEC-Gold to show the quality of synthetic data generation.
Anthology ID:
2026.findings-acl.2156
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
43428–43444
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2156/
DOI:
Bibkey:
Cite (ACL):
Syed Ahad, Burhanuddin Aliasghar Ezzi, Muhammad Arsalan Hussain, Sandesh Kumar, and Abdul Samad. 2026. Corpora Generation for Urdu Grammatical Error Correction. In Findings of the Association for Computational Linguistics: ACL 2026, pages 43428–43444, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Corpora Generation for Urdu Grammatical Error Correction (Ahad et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2156.pdf
Checklist:
 2026.findings-acl.2156.checklist.pdf