Ben Jenkins
2026
Beyond Static Benchmarks: A Validity, Reliability, and Sociotechnical Framework for Evaluating LLMs in Deployment Contexts
Ben Jenkins
Proceedings of the Workshop on Evaluating Evaluations (EvalEval)
Ben Jenkins
Proceedings of the Workshop on Evaluating Evaluations (EvalEval)
Static leaderboards summarize large language model (LLM) performance but offer weak evidence under shifting usage, noisy inputs, and plural stakeholder values. We present VRS-Eval, operationalizing deployment validity (benchmark vs. deployment score alignment), operational reliability (stability under a declared perturbation family), and sociotechnical alignment (metric vs. elicited rubric weights as a thin audit summary). With a reproducible simulator under explicit PB vs. PD shift and multi-turn interaction, we stress-test evaluation protocols in a controlled environment: under our main setting, benchmark-side scores (on PB) exceed estimated deploymentside utility scores (evaluated on trajectories from PD) by roughly 21–26% in relative terms across three metrics, with tight 95% percentile intervals (K=200). Failure mixtures emphasize overfitting, shift fragility, and rubric misalignment, consistent with firstvs. third-party reporting asymmetries (Reuel et al., 2025). A staged pipeline narrows the validity gap and raises reliability for the same generative story. Sensitivity sweeps over |Ω| and rubric-label rate preserve the rank ordering of harnesses, suggesting the qualitative conclusions are robust to plausible design-choice variation within the simulator. We discuss harness and accountability implications.
Thinking in Pictures: A Diagnostic Study of Visual vs. Textual Chain-of-Thought Reasoning in Vision-Language Models
Ben Jenkins
Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR)
Ben Jenkins
Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR)
Chain-of-thought (CoT) reasoning has become a standard technique for eliciting complex reasoning in large language models, and recent work has extended it to vision-language models (VLMs). However, virtually all multimodal CoT methods generate intermediate reasoning steps in natural language, even for inherently visual problems such as spatial reasoning, geometric manipulation, and object tracking. We ask a fundamental question: when should a VLM reason in words, and when should it reason in pictures? We present VisCoT-Diag, a diagnostic benchmark of 1,200 instances across five visual reasoning categories, and compare four CoT paradigms across four VLMs. Our results reveal a striking modality gap: textual CoT degrades performance by up to 17.5% on spatial transformation and 13.2% on multi-object tracking, while visual CoT yields gains of up to 23.1%. We identify three failure modes (spatial state collapse, transformation hallucination, tracking loss) and show that adaptive modality routing achieves 73.1% accuracy versus 68.9% for V-CoT-everywhere. We recommend practitioners use visual CoT for spatial tasks and textual CoT for compositional counting.