Beyond Static Benchmarks: A Validity, Reliability, and Sociotechnical Framework for Evaluating LLMs in Deployment Contexts

Ben Jenkins


Abstract
Static leaderboards summarize large language model (LLM) performance but offer weak evidence under shifting usage, noisy inputs, and plural stakeholder values. We present VRS-Eval, operationalizing deployment validity (benchmark vs. deployment score alignment), operational reliability (stability under a declared perturbation family), and sociotechnical alignment (metric vs. elicited rubric weights as a thin audit summary). With a reproducible simulator under explicit PB vs. PD shift and multi-turn interaction, we stress-test evaluation protocols in a controlled environment: under our main setting, benchmark-side scores (on PB) exceed estimated deploymentside utility scores (evaluated on trajectories from PD) by roughly 21–26% in relative terms across three metrics, with tight 95% percentile intervals (K=200). Failure mixtures emphasize overfitting, shift fragility, and rubric misalignment, consistent with firstvs. third-party reporting asymmetries (Reuel et al., 2025). A staged pipeline narrows the validity gap and raises reliability for the same generative story. Sensitivity sweeps over |Ω| and rubric-label rate preserve the rank ordering of harnesses, suggesting the qualitative conclusions are robust to plausible design-choice variation within the simulator. We discuss harness and accountability implications.
Anthology ID:
2026.evaleval-1.30
Volume:
Proceedings of the Workshop on Evaluating Evaluations (EvalEval)
Month:
July
Year:
2026
Address:
San Diego, CA
Editors:
Mubashara Akhtar, Jan Batzner, Leshem Choshen, Avijit Ghosh, Usman Gohar, Jennifer Mickel, Ichhya Pant, Zeerak Talat, Michelle Lin
Venues:
EvalEval | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
201–210
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.evaleval-1.30/
DOI:
Bibkey:
Cite (ACL):
Ben Jenkins. 2026. Beyond Static Benchmarks: A Validity, Reliability, and Sociotechnical Framework for Evaluating LLMs in Deployment Contexts. In Proceedings of the Workshop on Evaluating Evaluations (EvalEval), pages 201–210, San Diego, CA. Association for Computational Linguistics.
Cite (Informal):
Beyond Static Benchmarks: A Validity, Reliability, and Sociotechnical Framework for Evaluating LLMs in Deployment Contexts (Jenkins, EvalEval 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.evaleval-1.30.pdf