2025
pdf
bib
abs
Multi-Feature Graph Convolution Network for Hindi OCR Verification
Shikhar Dubey
|
Krish Mittal
|
Sourava Kumar Behera
|
Manikandan Ravikiran
|
Nitin Kumar
|
Saurabh Shigwan
|
Rohit Saluja
Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025)
This paper presents a novel Graph Convolutional Network (GCN) based framework for verifying OCR predictions on real Hindi document images, specifically addressing the challenges of complex conjuncts and character segmentation. Our approach first segments Hindi characters in real book images at different levels of granularity, while also synthetically generating word images from OCR predictions. Both real and synthetic images are processed through ResNet-50 to extract feature representations, which are then segmented using multiple patching strategies (uniform, akshara, random, and letter patches). The bounding boxes created using segmentation masks are scaled proportionally to the feature space while extracting features for GCN. We construct a line graph where each node represents a real-synthetic character pair (in feature space). Each node of the line graph captures semantic and geometric features including i) cross-entropy between original and synthetic features, ii) Hu moments difference for shape properties, and iii) and pixel count difference for size variation. The GCN with three convolutional layers (and ELU activation) processes these graph-structured features to verify the correctness of OCR predictions. Experimental evaluation on 1000 images from diverse Hindi books demonstrates the effectiveness of our graph-based verification approach in detecting OCR errors, particularly for challenging conjunct characters where traditional methods struggle.
pdf
bib
abs
INDRA: Iterative Difficulty Refinement Attention for MCQ Difficulty Estimation for Indic Languages
Manikandan Ravikiran
|
Rohit Saluja
|
Arnav Bhavsar
Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025)
Estimating the difficulty of multiple-choice questions (MCQs) is central to adaptive testing and learner modeling. We introduce INDRA (Iterative Difficulty Refinement Attention), a novel attention mechanism that unifies psychometric priors with neural refinement for Indic MCQ difficulty estimation. INDRA incorporates three key innovations: (i) IRT-informed initialization, which assigns token-level discrimination and difficulty scores to embed psychometric interpretability; (ii) entropy-driven iterative refinement, which progressively sharpens attention to mimic the human process of distractor elimination; and (iii) Indic Aware Graph Coupling, which propagates plausibility across morphologically and semantically related tokens, a critical feature for Indic languages. Experiments on TEEMIL-H and TEEMIL-K datasets show that INDRA achieves consistent improvements, with absolute gains of up to +1.02 F1 and +1.68 F1 over state-of-the-art, while demonstrating through ablation studies that psychometric priors, entropy refinement, and graph coupling contribute complementary gains to accuracy and robustness.
pdf
bib
abs
AnciDev: A Dataset for High-Accuracy Handwritten Text Recognition of Ancient Devanagari Manuscripts
Vriti Sharma
|
Rajat Verma
|
Rohit Saluja
Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025)
The digital preservation and accessibility of historical documents require accurate and scalable Handwritten Text Recognition (HTR). However, progress in this field is significantly hampered for low-resource scripts, such as ancient forms of the scripts used in historical manuscripts, due to the scarcity of high-quality, transcribed training data. We address this critical gap by introducing the AnciDev Dataset, a novel, publicly available resource comprising 3,000 transcribed text lines sourced from 500 pages of different ancient Devanagari manuscripts. To validate the utility of this new resource, we systematically evaluate and fine-tune several HTR models on the AnciDev Dataset. Our experiments demonstrate a significant performance uplift across all fine-tuned models, with the best-performing architecture achieving a substantial reduction in Character Error Rate (CER), confirming the dataset’s efficacy in addressing the unique complexities of ancient handwriting. This work not only provides a crucial, well-curated dataset to the research community but also sets a new, reproducible state-of-the-art for the HTR of historical Devanagari, advancing the effort to digitally preserve India’s documentary heritage.
pdf
bib
abs
TEEMIL : Towards Educational MCQ Difficulty Estimation in Indic Languages
Manikandan Ravikiran
|
Siddharth Vohra
|
Rajat Verma
|
Rohit Saluja
|
Arnav Bhavsar
Proceedings of the 31st International Conference on Computational Linguistics
Difficulty estimation of multiple-choice questions (MCQs) is crucial for creating effective educational assessments, yet remains underexplored in Indic languages like Hindi and Kannada due to the lack of comprehensive datasets. This paper addresses this gap by introducing two datasets, TEEMIL-H and TEEMIL-K, containing 4689 and 4215 MCQs, respectively, with manually annotated difficulty labels. We benchmark these datasets using state-of-the-art multilingual models and conduct ablation studies to analyze the effect of context, the impact of options, and the presence of the None of the Above (NOTA) option on difficulty estimation. Our findings establish baselines for difficulty estimation in Hindi and Kannada, offering valuable insights into improving model performance and guiding future research in MCQ difficulty estimation .
pdf
bib
abs
HiLearners: Non-Native Spoken Hindi Error Correction
Sourava Kumar Behera
|
Rohit Saluja
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
While majority of current resources rely on formal text corrections, our work shifts the focus to non-native spoken Hindi error correction, which presents unique challenges due to its rich morphology, complex syntax, and distinct error patterns. To address the scarcity of authentic learner data, we introduce HiLearners, a dataset gathered from 2,500 real non-native Hindi speakers across three linguistic backgrounds (English, Bengali, Dravidian), capturing authentic error patterns including transfer errors, overgeneralization patterns, and contextual agreement issues. Furthermore, to overcome data resource limitations, we develop a methodical synthetic data augmentation technique, utilizing Large Language Models (LLMs) with a pattern analysis and controlled generation approach similar to Retrieval-Augmented Generation (RAG), yielding 5,500 carefully verified synthetic examples. Through extensive experiments on individual, mixed, and progressive curriculum-based configurations using multilingual models, we demonstrate that LLM-based synthetic data combined with three-phase curriculum learning significantly boosts performance, achieving a 76.92 GLEU score and surpassing human-only baselines. This work bridges the gap between native-centric error correction research and non-native Hindi learner needs, establishing a realistic assessment standard for advancing low-resource language processing.
pdf
bib
Assessing ASR Robustness for Burmese: Impacts of Missing Speech Segments and Interruptions
Ankit Maurya
|
Manikandan Ravikiran
|
Rohit Saluja
Proceedings of the 8th International Conference on Natural Language and Speech Processing (ICNLSP-2025)
pdf
bib
HiSlang-4.9k: A Benchmark Dataset for Hindi Slang Detection and Identification
Tanmay Tiwari
|
Vibhu Gupta
|
Manikandan Ravikiran
|
Rohit Saluja
Proceedings of the 8th International Conference on Natural Language and Speech Processing (ICNLSP-2025)