HiLearners: Non-Native Spoken Hindi Error Correction

Sourava Kumar Behera; Rohit Saluja

HiLearners: Non-Native Spoken Hindi Error Correction

Abstract

While majority of current resources rely on formal text corrections, our work shifts the focus to non-native spoken Hindi error correction, which presents unique challenges due to its rich morphology, complex syntax, and distinct error patterns. To address the scarcity of authentic learner data, we introduce HiLearners, a dataset gathered from 2,500 real non-native Hindi speakers across three linguistic backgrounds (English, Bengali, Dravidian), capturing authentic error patterns including transfer errors, overgeneralization patterns, and contextual agreement issues. Furthermore, to overcome data resource limitations, we develop a methodical synthetic data augmentation technique, utilizing Large Language Models (LLMs) with a pattern analysis and controlled generation approach similar to Retrieval-Augmented Generation (RAG), yielding 5,500 carefully verified synthetic examples. Through extensive experiments on individual, mixed, and progressive curriculum-based configurations using multilingual models, we demonstrate that LLM-based synthetic data combined with three-phase curriculum learning significantly boosts performance, achieving a 76.92 GLEU score and surpassing human-only baselines. This work bridges the gap between native-centric error correction research and non-native Hindi learner needs, establishing a realistic assessment standard for advancing low-resource language processing.

Anthology ID:: 2025.findings-ijcnlp.78
Volume:: Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Month:: December
Year:: 2025
Address:: Mumbai, India
Editors:: Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, Dhirendra Pratap Singh
Venue:: Findings
SIG:
Publisher:: The Asian Federation of Natural Language Processing and The Association for Computational Linguistics
Note:
Pages:: 1276–1288
Language:
URL:: https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.findings-ijcnlp.78/
DOI:
Bibkey:
Cite (ACL):: Sourava Kumar Behera and Rohit Saluja. 2025. HiLearners: Non-Native Spoken Hindi Error Correction. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 1276–1288, Mumbai, India. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics.
Cite (Informal):: HiLearners: Non-Native Spoken Hindi Error Correction (Behera & Saluja, Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.findings-ijcnlp.78.pdf

PDF Cite Search Fix data