Solmaz Panahi
2026
When LLMs Annotate: Reliability Challenges in Low-Resource NLI
Solmaz Panahi | John Kelleher | Vasudevan Nedumpozhimana
Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)
Solmaz Panahi | John Kelleher | Vasudevan Nedumpozhimana
Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)
This paper systematically evaluates LLM reliability on the complex semantic task of Natural Language Inference (NLI) in Farsi, assessing six prominent models across eight prompt variations through a multi-dimensional framework that measures accuracy, prompt sensitivity, and intra-class consistency. Our results demonstrate that prompt design—particularly the order of premise and hypothesis—significantly impacts prediction stability. Proprietary models (Claude-Opus-4, GPT-4o) exhibit superior stability and accuracy compared to open-weight alternatives. Across all models, the ’Neutral’ class emerges as the most challenging and least stable category. Crucially, we redefine model instability as a diagnostic tool for benchmark quality, demonstrating that observed disagreement often reflects valid challenges to ambiguous or erroneous gold-standard labels.