Satyadev Ahlawat

2026

Revisiting Evaluation of Question Answering Systems in Low-Resource Indic Languages: Bridging Human and Metric Alignment
Anuj Kumar | Satyadev Ahlawat | Yamuna Prasad | Virendra Singh
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Evaluating Question Answering (QA) systems in low-resource Indic languages remains challenging due to the scarcity of annotated data, high linguistic diversity, and the absence of reliable evaluation metrics. Many Indian languages are severely underrepresented, making it difficult to accurately assess the performance of Large Language Models (LLMs) on QA tasks. Commonly used metrics like BLEU, ROUGE-L, and BERTScore, while successful in machine translation and resource-rich scenarios, tend to perform poorly in low-resource QA settings. These metrics often exhibit issues such as compressed scoring ranges, excessive zero scores, and weak alignment with human judgments. To overcome these limitations, this work introduces the LRM²QAS (Language Robust Multi-aspect Metrics for Question Answering Systems). This composite evaluation framework integrates semantic similarity, factual completeness, numerical accuracy, and contextual relevance. The proposed metric is evaluated across eight Indic-language QA tasks using multiple LLMs, as well as on open-domain benchmarks NaturalQuestions (NQ) and TriviaQA (TQ). Across all settings, LRM²QAS demonstrates stronger agreement with human evaluation, as measured by Pearson, Spearman, and Kendall correlation coefficients. Experimental findings highlight that LRM²QAS provides more precise distinctions between model outputs and aligns more closely with human judgment, offering a reliable framework for evaluating multilingual QA in low-resource Indic languages.

2025

pdf bib abs

LRMGS: A Language-Robust Metric for Evaluating Question Answering in Very Low-Resource Indic Languages
Anuj Kumar | Satyadev Ahlawat | Yamuna Prasad | Virendra Singh
The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Reliable evaluation of Question Answering (QA) systems in low-resource Indic languages presents a significant challenge due to limited annotated datasets, linguistic diversity, and suitable evaluation metrics. Languages such as Sindhi, Manipuri, Dogri, Konkani, and Maithili are particularly underrepresented, creating difficulty in assessing Large Language Models (LLMs) on QA tasks. Existing metrics, including BLEU, ROUGE-L, and BERTScore, are effective in machine translation and high-resource settings; however, they often fail in low-resource QA due to score compression, zero-inflation, and poor scale alignment. To overcome this, LRMGS (Language-Robust Metric for Generative QA) is introduced to capture semantic and lexical agreement while preserving the score scale across languages. LRMGS is evaluated across 8 Indic languages and multiple LLMs, demonstrating consistently higher concordance with reference-based chrF++ scores, measured using the Concordance Correlation Coefficient (CCC). Experimental results indicate that LRMGS provides more accurate discrimination of system performance in very low-resource languages compared to existing metrics. This work establishes a robust and interpretable framework for evaluating QA systems in low-resource Indic languages, supporting more reliable multilingual model assessment.

pdf bib abs

KGFakeNet: A Knowledge Graph-Enhanced Model for Fake News Detection
Anuj Kumar | Pardeep Kumar | Abhishek Yadav | Satyadev Ahlawat | Yamuna Prasad
Proceedings of the Workshop on Generative AI and Knowledge Graphs (GenAIK)

The proliferation of fake news on social media has intensified the spread of misinformation, promoting societal biases, hate, and violence. While recent advancements in Generative AI (GenAI), particularly large language models (LLMs), have shown promise, these models often need more structured representation for accurate verification, as they rely on pre-trained data patterns without access to real-time or validated information. This study presents a framework that utilizes Open Information Extractor 6 (OpenIE6) to extract triplet relationships (subject-predicate-object) from statements and justifications to compute the cosine similarity between the Knowledge Graphs (KGs) of the statements and their supporting justification to precisely measure the relevance and alignment between them. This similarity feature is integrated with an attention mechanism over GenAI-generated embeddings to enhance the model’s ability to capture semantic features accurately. In addition, a Multi-Layer Perceptron (MLP) classifier is employed to integrate all features, resulting in a 4% improvement in accuracy and a 5% increase in F1-score over state-of-the-art LLM-based approaches.

pdf bib abs

MarathiEmoExplain: A Dataset for Sentiment, Emotion, and Explanation in Low-Resource Marathi
Anuj Kumar | Mohammed Faisal Sayed | Satyadev Ahlawat | Yamuna Prasad
Findings of the Association for Computational Linguistics: EMNLP 2025

Marathi, the third most widely spoken language in India with over 83 million native speakers, remains significantly underrepresented in Natural Language Processing (NLP) research. While sentiment analysis has achieved substantial progress in high-resource languages such as English, Chinese, and Hindi, available Marathi datasets are limited to coarse sentiment labels and lack fine-grained emotional categorization or interpretability through explanations. To address this gap, we present a new annotated dataset of 10,762 Marathi sentences, each labeled with sentiment (positive, negative, or neutral), emotion (joy, anger, surprise, disgust, sadness, fear, or neutral), and a corresponding natural language justification. Justifications are written in English and generated using GPT-4 under a human-in-the-loop framework to ensure label fidelity and contextual alignment. Extensive experiments with both classical and transformer-based models demonstrate the effectiveness of the dataset for interpretable affective computing in a low-resource language setting, offering a benchmark for future research in multilingual and explainable NLP.

Co-authors

Abhishek Yadav 1

Venues

WS1

Fix author