Corpora Generation for Urdu Grammatical Error Correction
Syed Ahad, Burhanuddin Aliasghar Ezzi, Muhammad Arsalan Hussain, Sandesh Kumar, Abdul Samad
Abstract
Grammatical Error Correction (GEC) for Urdu remains an under-researched area due to the lack of annotated datasets. This paper addresses the challenge of generating a robust corpus for fine-tuning deep learning models aimed at Urdu GEC. We propose a method for synthesizing a large dataset by collecting errors from the Urdu WikiEdits history, learning from them, and inserting similar errors in grammatically correct sentences to generate incorrect sentences with grammatical errors, hence creating a pair of grammatically correct and incorrect sentences. We introduce UrduGEC-Synthetic, a synthetically generated dataset produced through this pipeline. Furthermore, we introduce UrduGEC-Gold, a Gold Dataset by extracting errors from exam copies of students. Finally, we also fine-tuned various models on UrduGEC-Synthetic and evaluated them against UrduGEC-Gold to show the quality of synthetic data generation.- Anthology ID:
- 2026.findings-acl.2156
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 43428–43444
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2156/
- DOI:
- Cite (ACL):
- Syed Ahad, Burhanuddin Aliasghar Ezzi, Muhammad Arsalan Hussain, Sandesh Kumar, and Abdul Samad. 2026. Corpora Generation for Urdu Grammatical Error Correction. In Findings of the Association for Computational Linguistics: ACL 2026, pages 43428–43444, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Corpora Generation for Urdu Grammatical Error Correction (Ahad et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2156.pdf