Sunil Jaat


2025

pdf bib
Team Horizon at BHASHA Task 1: Multilingual IndicGEC with Transformer-based Grammatical Error Correction Models
Manav Dhamecha | Sunil Jaat | Gaurav Damor | Pruthwik Mishra
Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025)

This paper presents Team Horizon’s approach to the BHASHA Shared Task 1: Indic Grammatical Error Correction (IndicGEC). We explore transformer-based multilingual models — mT5-small and IndicBART — to correct grammatical and semantic errors across five Indian languages: Bangla, Hindi, Tamil, Telugu, and Malayalam. Due to limited annotated data, we developed a synthetic data augmentation pipeline that introduces realistic linguistic errors under ten categories, simulating natural mistakes found in Indic scripts. Our fine-tuned models achieved competitive performance with GLEU scores of 86.03 (Tamil), 72.00 (Telugu), 82.69 (Bangla), 80.44 (Hindi), and 84.36 (Malayalam). We analyze the impact of dataset scaling, multilingual fine-tuning, and training epochs, showing that linguistically grounded augmentation can significantly improve grammatical correction accuracy in low-resource Indic languages.

pdf bib
Team Horizon at BHASHA Task 2: Fine-tuning Multilingual Transformers for Indic Word Grouping
Manav Dhamecha | Gaurav Damor | Sunil Jaat | Pruthwik Mishra
Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025)

We present Team Horizon’s approach to BHASHA Task 2: Indic Word Grouping. We model the word-grouping problem as token classification problem and fine-tune multilingual Transformer encoders for the task. We evaluated MuRIL, XLM-Roberta, and IndicBERT v2 and report Exact Match accuracy on the test data. Our best model (MuRIL) achieves 58.1818% exact match on the test set.