Vasudevan Nedumpozhimana

2026

SynthLLM: An LLM-based Scalable Synthetic Data Generation Pipeline for Low-Resource Languages
Solmaz Panahi | Vasudevan Nedumpozhimana | John Kelleher
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Large Language Models (LLMs) have enabled scalable synthetic data generation, yet their effective adaptation to low-resource languages remains underexplored. We introduce an LLM-based generate and annotate paradigm to create synthetic datasets for low-resource NLP classification tasks. The framework employs a smaller model for text generation and a stronger model for automatic annotation. Using Farsi Natural Language Inference (NLI) as a case study, we construct a large-scale synthetic dataset of 100,000 labeled instances. We provide a systematic empirical analysis of annotation quality, label-distribution effects, and training regimes. We compare GPT-4o-mini, Aya-23-35B, and DeBERTa as annotators and examine how annotation variability propagates to downstream performance. Our results show that a warm-up phase with synthetic data consistently outperforms data mixing and reversed ordering. Notably, open-source annotation (Aya-23-35B) achieves comparable downstream performance to the proprietary model (GPT-4o-mini), with significant cost implications for deploying pipelines in low-resource settings. The dataset and code are publicly available at https://huggingface.co/datasets/Solmazp/text2entail.

pdf bib abs

When LLMs Annotate: Reliability Challenges in Low-Resource NLI
Solmaz Panahi | John Kelleher | Vasudevan Nedumpozhimana
Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)

This paper systematically evaluates LLM reliability on the complex semantic task of Natural Language Inference (NLI) in Farsi, assessing six prominent models across eight prompt variations through a multi-dimensional framework that measures accuracy, prompt sensitivity, and intra-class consistency. Our results demonstrate that prompt design—particularly the order of premise and hypothesis—significantly impacts prediction stability. Proprietary models (Claude-Opus-4, GPT-4o) exhibit superior stability and accuracy compared to open-weight alternatives. Across all models, the ’Neutral’ class emerges as the most challenging and least stable category. Crucially, we redefine model instability as a diagnostic tool for benchmark quality, demonstrating that observed disagreement often reflects valid challenges to ambiguous or erroneous gold-standard labels.

2023

pdf bib abs

Using MT for multilingual covid-19 case load prediction from social media texts
Maja Popovic | Vasudevan Nedumpozhimana | Meegan Gower | Sneha Rautmare | Nishtha Jain | John Kelleher
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

In the context of an epidemiological study involving multilingual social media, this paper reports on the ability of machine translation systems to preserve content relevant for a document classification task designed to determine whether the social media text is related to covid. The results indicate that machine translation does provide a feasible basis for scaling epidemiological social media surveillance to multiple languages. Moreover, a qualitative error analysis revealed that the majority of classification errors are not caused by MT errors.

pdf bib abs

Idioms, Probing and Dangerous Things: Towards Structural Probing for Idiomaticity in Vector Space
Filip Klubička | Vasudevan Nedumpozhimana | John Kelleher
Proceedings of the 19th Workshop on Multiword Expressions (MWE 2023)

The goal of this paper is to learn more about how idiomatic information is structurally encoded in embeddings, using a structural probing method. We repurpose an existing English verbal multi-word expression (MWE) dataset to suit the probing framework and perform a comparative probing study of static (GloVe) and contextual (BERT) embeddings. Our experiments indicate that both encode some idiomatic information to varying degrees, but yield conflicting evidence as to whether idiomaticity is encoded in the vector norm, leaving this an open question. We also identify some limitations of the used dataset and highlight important directions for future work in improving its suitability for a probing analysis.

pdf bib abs

Medical Concept Mention Identification in Social Media Posts Using a Small Number of Sample References
Vasudevan Nedumpozhimana | Sneha Rautmare | Meegan Gower | Nishtha Jain | Maja Popović | Patricia Buffini | John Kelleher
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Identification of mentions of medical concepts in social media text can provide useful information for caseload prediction of diseases like Covid-19 and Measles. We propose a simple model for the automatic identification of the medical concept mentions in the social media text. We validate the effectiveness of the proposed model on Twitter, Reddit, and News/Media datasets.

2021

pdf bib abs

Finding BERT’s Idiomatic Key
Vasudevan Nedumpozhimana | John Kelleher
Proceedings of the 17th Workshop on Multiword Expressions (MWE 2021)

Sentence embeddings encode information relating to the usage of idioms in a sentence. This paper reports a set of experiments that combine a probing methodology with input masking to analyse where in a sentence this idiomatic information is taken from, and what form it takes. Our results indicate that BERT’s idiomatic key is primarily found within an idiomatic expression, but also draws on information from the surrounding context. Also, BERT can distinguish between the disruption in a sentence caused by words missing and the incongruity caused by idiomatic usage.

Co-authors

Sneha Rautmare 2

Patricia Buffini 1

Filip Klubicka 1

Venues

Fix author