IndiGEC: Multilingual Grammar Error Correction for Low-Resource Indian Languages

Ujjwal Sharma, Pushpak Bhattacharyya


Abstract
Grammatical Error Correction (GEC) for low-resource Indic languages faces significant challenges due to the scarcity of annotated data. In this work, we introduce the Mask-Translate&Fill (MTF) framework, a novel approach for generating high-quality synthetic data for GEC using only monolingual corpora. MTF leverages a machine translation system and a pretrained masked language model to introduce synthetic errors and tries to mimic errors made by second-language learners. Our experimental results on English, Hindi, Bengali, Marathi, and Tamil demonstrate that MTF consistently outperforms other monolingual synthetic data generation methods and achieves performance comparable to the Translation Language Modeling (TLM)-based approach, which uses a bilingual corpus, in both independent and multilingual settings. Under multilingual training, MTF yields significant improvements across Indic languages, with particularly notable gains in Bengali and Tamil, achieving +1.6 and +3.14 GLEU over the TLM-based method, respectively. To support further research, we also introduce the IndiGEC Corpus, a high-quality, human-written, manually validated GEC dataset for these four Indic languages, comprising over 8,000 sentence pairs with separate development and test splits.
Anthology ID:
2025.emnlp-main.1139
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
22393–22407
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1139/
DOI:
Bibkey:
Cite (ACL):
Ujjwal Sharma and Pushpak Bhattacharyya. 2025. IndiGEC: Multilingual Grammar Error Correction for Low-Resource Indian Languages. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 22393–22407, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
IndiGEC: Multilingual Grammar Error Correction for Low-Resource Indian Languages (Sharma & Bhattacharyya, EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1139.pdf
Checklist:
 2025.emnlp-main.1139.checklist.pdf