All Prompts Are Created Equal? Evaluating Robustness of LLM Judges Against Non-Adversarial Prompt Variations

Savita Bhat; Vasudeva Varma

All Prompts Are Created Equal? Evaluating Robustness of LLM Judges Against Non-Adversarial Prompt Variations

Abstract

LLM-based evaluation systems (LLM judges) have emerged as a scalable alternative to expensive human evaluations. Although LLM judges demonstrate 70-80% agreement with human evaluators, their robustness under semantically equivalent prompt variations remains underexplored. Through systematic evaluation of 8 models across 4 NLG tasks using 10 semantically equivalent paraphrases per prompt (~115000 evaluations), we identify a critical accuracy-robustness gap: attribute verifiability affects the robustness more than model choice, with factually verifiable attributes achieving 0.71 accuracy versus 0.19 for subjective attributes. Our investigations discover three key insights: 1) Task structure characteristics influence the robustness and in turn accuracy, 2) Attribute verifiability as the strongest predictor-factually verifiable attribute achieve 0.71 accuracy versus 0.19 for subjective attributes, 3) No single winning model-smallest model (Llama-3.1-8B) exhibits second-best performance, while the strongest model (Llama-4) from the same family significantly lag behind, thus demonstrating that general capability improvements do not necessarily result in evaluation robustness. With these findings, we propose a diagnostic framework grounded in attribute verifiability that enables principled decisions about evaluation automation. Our work establishes new standards for assessing LLM judge reliability beyond simple accuracy metrics.

Anthology ID:: 2026.findings-acl.1929
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 38730–38745
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.findings-acl.1929/
DOI:
Bibkey:
Cite (ACL):: Savita Bhat and Vasudeva Varma. 2026. All Prompts Are Created Equal? Evaluating Robustness of LLM Judges Against Non-Adversarial Prompt Variations. In Findings of the Association for Computational Linguistics: ACL 2026, pages 38730–38745, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: All Prompts Are Created Equal? Evaluating Robustness of LLM Judges Against Non-Adversarial Prompt Variations (Bhat & Varma, Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.findings-acl.1929.pdf
Checklist:: 2026.findings-acl.1929.checklist.pdf

PDF Cite Search Checklist Fix data