Ben Jenkins


2026

Static leaderboards summarize large language model (LLM) performance but offer weak evidence under shifting usage, noisy inputs, and plural stakeholder values. We present VRS-Eval, operationalizing deployment validity (benchmark vs. deployment score alignment), operational reliability (stability under a declared perturbation family), and sociotechnical alignment (metric vs. elicited rubric weights as a thin audit summary). With a reproducible simulator under explicit PB vs. PD shift and multi-turn interaction, we stress-test evaluation protocols in a controlled environment: under our main setting, benchmark-side scores (on PB) exceed estimated deploymentside utility scores (evaluated on trajectories from PD) by roughly 21–26% in relative terms across three metrics, with tight 95% percentile intervals (K=200). Failure mixtures emphasize overfitting, shift fragility, and rubric misalignment, consistent with firstvs. third-party reporting asymmetries (Reuel et al., 2025). A staged pipeline narrows the validity gap and raises reliability for the same generative story. Sensitivity sweeps over |Ω| and rubric-label rate preserve the rank ordering of harnesses, suggesting the qualitative conclusions are robust to plausible design-choice variation within the simulator. We discuss harness and accountability implications.
Chain-of-thought (CoT) reasoning has become a standard technique for eliciting complex reasoning in large language models, and recent work has extended it to vision-language models (VLMs). However, virtually all multimodal CoT methods generate intermediate reasoning steps in natural language, even for inherently visual problems such as spatial reasoning, geometric manipulation, and object tracking. We ask a fundamental question: when should a VLM reason in words, and when should it reason in pictures? We present VisCoT-Diag, a diagnostic benchmark of 1,200 instances across five visual reasoning categories, and compare four CoT paradigms across four VLMs. Our results reveal a striking modality gap: textual CoT degrades performance by up to 17.5% on spatial transformation and 13.2% on multi-object tracking, while visual CoT yields gains of up to 23.1%. We identify three failure modes (spatial state collapse, transformation hallucination, tracking loss) and show that adaptive modality routing achieves 73.1% accuracy versus 68.9% for V-CoT-everywhere. We recommend practitioners use visual CoT for spatial tasks and textual CoT for compositional counting.