Burhanuddin Aliasghar Ezzi

2026

Corpora Generation for Urdu Grammatical Error Correction
Syed Ahad | Burhanuddin Aliasghar Ezzi | Muhammad Arsalan Hussain | Sandesh Kumar | Abdul Samad
Findings of the Association for Computational Linguistics: ACL 2026

Grammatical Error Correction (GEC) for Urdu remains an under-researched area due to the lack of annotated datasets. This paper addresses the challenge of generating a robust corpus for fine-tuning deep learning models aimed at Urdu GEC. We propose a method for synthesizing a large dataset by collecting errors from the Urdu WikiEdits history, learning from them, and inserting similar errors in grammatically correct sentences to generate incorrect sentences with grammatical errors, hence creating a pair of grammatically correct and incorrect sentences. We introduce UrduGEC-Synthetic, a synthetically generated dataset produced through this pipeline. Furthermore, we introduce UrduGEC-Gold, a Gold Dataset by extracting errors from exam copies of students. Finally, we also fine-tuned various models on UrduGEC-Synthetic and evaluated them against UrduGEC-Gold to show the quality of synthetic data generation.

Co-authors

Venues

Findings1

Fix author