Solmaz Panahi
2026
SynthLLM: An LLM-based Scalable Synthetic Data Generation Pipeline for Low-Resource Languages
Solmaz Panahi | Vasudevan Nedumpozhimana | John Kelleher
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Solmaz Panahi | Vasudevan Nedumpozhimana | John Kelleher
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Large Language Models (LLMs) have enabled scalable synthetic data generation, yet their effective adaptation to low-resource languages remains underexplored. We introduce an LLM-based generate and annotate paradigm to create synthetic datasets for low-resource NLP classification tasks. The framework employs a smaller model for text generation and a stronger model for automatic annotation. Using Farsi Natural Language Inference (NLI) as a case study, we construct a large-scale synthetic dataset of 100,000 labeled instances. We provide a systematic empirical analysis of annotation quality, label-distribution effects, and training regimes. We compare GPT-4o-mini, Aya-23-35B, and DeBERTa as annotators and examine how annotation variability propagates to downstream performance. Our results show that a warm-up phase with synthetic data consistently outperforms data mixing and reversed ordering. Notably, open-source annotation (Aya-23-35B) achieves comparable downstream performance to the proprietary model (GPT-4o-mini), with significant cost implications for deploying pipelines in low-resource settings. The dataset and code are publicly available at https://huggingface.co/datasets/Solmazp/text2entail.
When LLMs Annotate: Reliability Challenges in Low-Resource NLI
Solmaz Panahi | John Kelleher | Vasudevan Nedumpozhimana
Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)
Solmaz Panahi | John Kelleher | Vasudevan Nedumpozhimana
Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)
This paper systematically evaluates LLM reliability on the complex semantic task of Natural Language Inference (NLI) in Farsi, assessing six prominent models across eight prompt variations through a multi-dimensional framework that measures accuracy, prompt sensitivity, and intra-class consistency. Our results demonstrate that prompt design—particularly the order of premise and hypothesis—significantly impacts prediction stability. Proprietary models (Claude-Opus-4, GPT-4o) exhibit superior stability and accuracy compared to open-weight alternatives. Across all models, the ’Neutral’ class emerges as the most challenging and least stable category. Crucially, we redefine model instability as a diagnostic tool for benchmark quality, demonstrating that observed disagreement often reflects valid challenges to ambiguous or erroneous gold-standard labels.