Workshop on Evaluating Evaluations (2026)


up

pdf (full)
bib (full)
Proceedings of the Workshop on Evaluating Evaluations (EvalEval)

Current machine learning models are evaluated through behavioral snapshots, with benchmark accuracies, win rates and outcome-based metrics. Model explanations and evaluations, however, are fundamentally intertwined: understanding why a model produces a behavior can be as important as measuring what it produces. If we trusted interpretability, we argue that it can serve not merely as diagnostics but as a richer and more principled form of model evaluation beyond surface-level performance metrics. We explore three ways interpretability can function evaluatively: (1) fixing problems by identifying the root causes of unwanted behavior, (2) detecting subtly faulty mechanisms that invalidate model outputs, and (3) predicting potential issues before they arise by fully understanding the model’s weaknesses. To fulfill its evaluative potential, we argue that interpretability methods must generate claims that are falsifiable, reproducible, and predictive—that is, interpretability must meet scientific standards.
Large language models (LLMs) are increasingly used as collaborative assistants, yet dominant NLP evaluation practices remain centered on aggregate metrics such as accuracy and fluency. These approaches often overlook behaviors that are critical in human-facing settings (e.g., consistency across multiple turns and iterative refinement). In this paper, we examine limitations of current NLP evaluation practices and introduce TCR, a structured framework for evaluating human–AI interaction using educational LLM assistants as an illustrative example. TCR emphasizes dimensions such as transparency, consistency, and refinement. We further present structured evaluation prompts and illustrative interaction examples demonstrating how structured evaluation can complement aggregate metrics and LLM-as-a-judge approaches. Our work highlights the need for more human-centered evaluation practices for interactive LLM systems.
AI ethics guidelines for humanitarian settings have grown in number and scope. Whether they produce their intended outcomes depends on which deployers are expected to follow them. These guidelines respond to documented risks: surveillance, data misuse, and discriminatory outcomes affecting refugee populations. For high-risk applications such as biometric identification and asylum adjudication, the concerns they address are genuine. Many differentiate risk tiers in principle, yet the compliance expectations they establish (staff capacity, technical infrastructure, formal evaluation) reflect the organizational contexts in which they were developed. Many nonprofits providing frontline services to refugees operate with limited administrative capacity. When compliance requirements exceed what these organizations can meet, formal AI adoption stalls, while informal adoption proceeds without oversight or recourse. Current guidelines also tend to treat non-adoption as a neutral default, without accounting for the service gaps that follow when AI-assisted language access is unavailable. Drawing on collaboration with refugee-serving practitioners, we show that this gap between governance design and organizational reality has consequences for the people these guidelines are meant to protect. Evaluating AI guidelines, we argue, requires the same realist logic that evaluation research has long applied to social programs: not "does this guideline exist?" but "for which deployers, under what conditions, and does it produce its intended protective outcomes?"
This paper studies whether large language models can extract useful sentiment signals from firm-specific financial news when evaluation accounts for realistic market frictions. Many financial NLP studies report strong offline prediction results, but these do not always show whether model outputs remain useful once trading constraints are imposed. I address this gap by evaluating sentiment models through classification performance, return predictability, and implementable portfolio performance. The analysis links Refinitiv News Analytics to CRSP and begins with 3,129,924 U.S. news items published between January 1, 2010 and January 30, 2026. Filtering retains single-firm stories, removes redundant coverage using a five-day cosine-similarity novelty screen, and restricts the sample to tradable stocks with positive bid and ask quotes, minimum share and dollar volume thresholds, quoted spreads below 20%, and available Amihud illiquidity ratios and Kyle’s lambda estimates. The final sample contains 973,481 tradable news items linked to 3,452 firms. I compare six sentiment approaches: LLaMA–3, OPT, RoBERTa, BERT, FinBERT, and the Loughran–McDonald dictionary. LLaMA–3 achieves the strongest classification performance with 78.2% accuracy and produces the largest predictive coefficients in panel regressions. Daily rebalanced long–short portfolios with a 5 bps trading cost show that the LLaMA–3 strategy earns a cumulative return of approximately 180% from June 2024 to January 2026, followed by OPT with 155% and RoBERTa with 120%, while the dictionarybased strategy loses 9%. The results show that evaluation becomes more informative when financial NLP models are assessed beyond offline accuracy and under realistic deployment constraints. High-capacity language models retain economically meaningful predictive content under market frictions, whereas simpler lexicon-based methods do not.
Standard benchmarks for large language models (LLMs) assume that task feedback is truthful, but real-world reasoning often requires processing unreliable or adversarial information. We introduce WordleArenas, a benchmark platform that evaluates LLM reasoning robustness across a deception gradient. Building on Wordle and its deceptive variant Fibble (Chusap et al., 2025), we generalize to Fibblek (k = 0, . . . , 5 lies per row), creating a controlled evaluation of LLM robustness to misinformation. Across six arenas — standard Wordle (0 lies per row) through Fibble5 (5 lies per row) — we evaluate 41 models from 10 providers across 3,749 games. We find that (1) even one lie per row causes catastrophic performance drops (average win rate falls from 41.1% to 18.7%), (2) a sharp deception cliff emerges at 2–3 lies where nearly all models collapse to ≤3% win rate, and (3) model robustness to deception is poorly predicted by standard benchmark rankings. A surprising Fibble5 recovery emerges: some models recover partial performance when all feedback lies (average 9.5%), outperforming Fibble3 (0.3%) and Fibble4 (0.4%), because knowing that every tile lies restores deterministic — though partial — information. Our results demonstrate that truthful-feedback evaluations systematically overestimate LLM reasoning capabilities and that deception-aware benchmarks are essential for assessing real-world robustness. All code and data are publicly available.
Recent work identifies a stated–revealed (SvR) preference gap in language models (LMs): a mismatch between the values models endorse and the choices they make in context. Existing evaluations rely heavily on binary forcedchoice prompting, which entangles genuine preferences with artifacts of the elicitation protocol. We systematically study how elicitation protocols affect SvR correlation across 24 LMs. Allowing neutrality and abstention during stated preference elicitation allows us to exclude weak signals, substantially improving Spearman’s rank correlation (ρ) between volunteered stated preferences and forced-choice revealed preferences. However, further allowing abstention in revealed preferences drives ρ to near-zero or negative values due to high neutrality rates. Finally, we find that system prompt steering using stated preferences during revealed preference elicitation does not reliably improve SvR correlation on AIRiskDilemmas. Together, our results show that SvR correlation is highly protocol-dependent and that preference elicitation requires methods that account for indeterminate preferences.
Automated LLM vulnerability scanners are increasingly used to assess security risks by measuring different attack type success rates (ASR). Yet the validity of these measurements hinges on an often-overlooked component: the evaluator who determines whether an attack has succeeded. In this study, we demonstrate that commonly used open-source scanners exhibit measurement instability that depends on the evaluator component. Consequently, changing the evaluator while keeping the attacks and model outputs constant can significantly alter the reported ASR. To tackle this problem, we present a two-phase, reliability-aware evaluation framework. In the first phase, we quantify evaluator disagreement to identify attack categories where ASR reliability cannot be assumed. In the second phase, we propose a verification-based evaluation method where evaluators are validated by an independent verifier, enabling reliability assessment without relying on extensive human annotation. Applied to the widely used Garak scanner, we observe that 22 of 25 attack categories exhibit evaluator instability, reflected in high disagreement among evaluators. Our approach raises evaluator accuracy from 72% to 89% while enabling selective deployment to control cost and computational overhead. We further quantify evaluator uncertainty in ASR estimates, showing that reported vulnerability scores can vary by up to ±33% depending on the evaluator. Our results indicate that the outputs of vulnerability scanners are highly sensitive to the choice of evaluators. Our framework offers a practical approach to quantify unreliable evaluations and enhance the reliability of measurements in automated LLM security assessments.
This paper presents the first systematic comparison investigating whether Large Reasoning Models (LRMs) are superior judges to non-reasoning LLMs. Our empirical analysis yields four key findings: 1) LRMs outperform non-reasoning LLMs in terms of judgment accuracy, particularly on reasoning-intensive tasks; 2) LRMs demonstrate superior evaluation instruction-following capabilities; 3) LRMs exhibit enhanced robustness against adversarial attacks targeting judgment tasks; 4) However, LRMs still exhibit strong evaluation biases. To mitigate this bias vulnerability, we propose PlanJudge, a lightweight evaluation strategy that prompts the model to generate an explicit evaluation plan before executing the judgment. Despite its simplicity, our experiments demonstrate that PlanJudge significantly mitigates biases in LLM-as-a-Judge while preserving overall judgment accuracy1.
Large language models (LLMs) are often evaluated on benchmarks that rely on surfacelevel instructions, obscuring what defines highquality performance. We argue that tasks can be more precisely characterized through principles: human-readable rules that specify what matters for a good response to the task. Our study proposes a framework to automatically extract and generate task-level principles for data generation and evaluation. Using this approach, we build a benchmark of over 20K principle-aligned instances, enabling controllable data creation and fine-grained, interpretable assessment of LLMs. Experiments show that principles both improve output quality and scale evaluation beyond manual curation, offering a new recipe for principled assessment of LLM capabilities.1
Mathematical benchmarks consisting of a range of mathematics problems are widely used to evaluate the reasoning abilities of large language models, yet little is known about how their structural properties influence model behaviour. In this work, we investigate two structural length variables, prompt length and solution length, and analyse how they relate to model performance on a newly constructed adversarial dataset of expert-authored mathematics problems. Across five evaluated models, we find that both prompt length and solution length are positively associated with model failure. These associations are statistically significant but modest, and we interpret them as descriptive rather than causal. We also include a secondary, exploratory analysis of cross-model disagreement. Because disagreement measures based on variance are mechanically constrained by mean failure, we treat this part of the analysis cautiously. Overall, our main finding is that structural length is linked to empirical difficulty in this benchmark, suggesting that length should be considered as a potential confounder when interpreting mathematical model evaluations.
Benchmarks for assessing large language model (LLM) capabilities have been criticized for a lack of construct validity. Here, we focus on an often overlooked dimension of a benchmark’s validity: namely, the functional mapping between a benchmark’s numerical score and the underlying quantity the benchmark purports to measure. What licenses the assumption that equivalent intervals on a scale correspond to equivalent differences in the underlying capability? We argue that this question is not merely theoretical: the form of this mapping (e.g., linear vs. logarithmic vs. exponential) could and should influence decisions about deployment and regulatory policy. Drawing on work from the history and philosophy of science, we discuss an analogous problem in the early history of thermometry termed the problem of nomic measurement, as well as the epistemic practices that enabled scientists to overcome these challenges. We then ask whether a similar process of epistemic iteration can overcome this problem in benchmarking. Despite clear differences between temperature and “capabilities” as constructs, we argue that some modest success could be achievable in the domain of benchmarking—but that this depends crucially on the clear articulation of a researcher’s goals and theoretical commitments.
This paper introduces a curated dataset for diagnosing implicit gender bias through feminine tropes in narratives generated by large language models. Drawing from a crowd-sourced database of tropes from television media, we create prompts that elicit narratives from LLMs based on historically gendered tropes. We find that LLMs tend to revert to feminine characters in these narratives, even when prompted without explicit gender references, and also when prompted with non-binary (“they/them”) gender references for the main character. In some cases, even when prompted with masculine pronouns (“he/him”), LLMs still use feminine pronouns to describe the main character. The paper describes our dataset creation process and the evaluation of four open-weight models. We discuss implications for future research in mitigating implicit gender bias and its associated representational harms in LLMs, as well as the complex relationship between language models and societal values.
Effective AI risk assessment relies on the quality of evaluations. Currently, there are large quality differences, such as in construct validity and annotation, between existing benchmarks. In this work, we propose a quality scorecard for benchmarks designed to make this diversity easier to navigate. The scorecard employs two main components: dimensions, which provide granular scores of an evaluation under that dimension, and classifications, which correspond to concrete use-cases ranging from research to post-deployment. By establishing a common language and objective methods, this framework aims to aid in transparency and raise the baseline quality of benchmarks used across the ecosystem.
Tremendous efforts have been put into evaluating the inclusivity and effectiveness of AI systems across cultures. However, the cultural capabilities considered in much of the literature remain vaguely defined, are referred to using interchangeable terminology, and are typically limited to recalling accurate information about various demographics, regions, and nationalities. To address this construct ambiguity, we draw from Intercultural Communication scholarship and propose a three-level taxonomy of AI-relevant cultural capabilities: Cultural Awareness answers “Does the model know?”, Cultural Sensitivity answers “How does it frame its knowledge?”, and Cultural Competence answers “Can it adapt as the interaction evolves?”. Beyond conceptual clarification, we position this taxonomy as a practical tool for improving the validity and interpretability of AI evaluation in real-world, multicultural settings. Without such construct clarity, evaluation results risk overstating model capabilities and may lead to inappropriate deployment decisions in culturally sensitive contexts.
Evaluating large language models (LLMs) requires selecting benchmarks that fit the intended use case. However, the rapid growth of benchmarks has made discovery and comparison difficult, because practitioners must assemble information across papers, repositories, and dataset cards with heterogeneous metadata, inconsistent terminology, and uneven documentation. Prior work improves individual benchmark documentation and quality assessment, but does not provide a uniform way to compare benchmarks during discovery. We survey practitioners, analyze multi-source benchmark metadata, and identify the fields needed for effective benchmark discovery. We introduce BenchNavigator, a prototype that organizes heterogeneous metadata into a coherent, provenance-preserving interface aligned with practitioner priorities. Our results show that benchmark metadata can be presented in a comparable form without imposing new reporting burdens on benchmark producers. We frame this contribution as discovery infrastructure, not as a method for scoring benchmark quality or replacing contextual evaluation.
Static leaderboards summarize large language model (LLM) performance but offer weak evidence under shifting usage, noisy inputs, and plural stakeholder values. We present VRS-Eval, operationalizing deployment validity (benchmark vs. deployment score alignment), operational reliability (stability under a declared perturbation family), and sociotechnical alignment (metric vs. elicited rubric weights as a thin audit summary). With a reproducible simulator under explicit PB vs. PD shift and multi-turn interaction, we stress-test evaluation protocols in a controlled environment: under our main setting, benchmark-side scores (on PB) exceed estimated deploymentside utility scores (evaluated on trajectories from PD) by roughly 21–26% in relative terms across three metrics, with tight 95% percentile intervals (K=200). Failure mixtures emphasize overfitting, shift fragility, and rubric misalignment, consistent with firstvs. third-party reporting asymmetries (Reuel et al., 2025). A staged pipeline narrows the validity gap and raises reliability for the same generative story. Sensitivity sweeps over |Ω| and rubric-label rate preserve the rank ordering of harnesses, suggesting the qualitative conclusions are robust to plausible design-choice variation within the simulator. We discuss harness and accountability implications.
Rigorous evaluation of domain-specific language models requires benchmarks that are comprehensive, contamination-resistant, and maintainable. Static, manually curated datasets do not satisfy these properties. We present a graph-based evaluation harness that transforms structured clinical guidelines into a queryable knowledge graph and dynamically instantiates evaluation queries via graph traversal. The framework provides three guarantees: (1) complete coverage of guideline relationships; (2) surface-form contamination resistance through combinatorial variation; and (3) validity inherited from expert-authored graph structure. Applied to the WHO IMCI guidelines, the harness generates clinically grounded multiple-choice questions spanning symptom recognition, treatment, severity classification, and follow-up care. Evaluation across five language models reveals systematic capability gaps. Models perform well on symptom recognition but show lower accuracy on treatment protocols and clinical management decisions. The framework supports continuous regeneration of evaluation data as guidelines evolve and generalizes to domains with structured decision logic. This provides a scalable foundation for evaluation infrastructure. Data and Code Availability The WHO IMCI handbook is publicly available (WHO, 2014). Our graph construction, question generation code, and generated question dataset are available at https://github.com/jessicalundin/ graph_testing_harness.
RAG evaluations often rely on citations or retrieved evidence traces for correctness checks, provenance claims, and audits, implicitly assuming that evidence remains reproducible under routine retrieval settings. We test this assumption in a controlled diagnostic study where queries, embeddings, and decoding are fixed while retrieval depth, chunk size, and overlap vary. We call the resulting change in attributed evidence retrieval jitter and measure evidence identity at two levels: document (doc_id) and exact cited span (doc_id, span_hash). Across BEIR ArguAna and SciFact, we observe a consistent Stability Gap: document overlap remains moderate while span overlap often collapses, including many cases of total span turnover despite non-empty retrieval. We interpret span-level instability as a diagnostic of exact evidence-trace reproducibility, not semantic equivalence. These findings motivate reporting stability diagnostics alongside citation-based evaluation metrics for more reproducible evaluation practice.
AI systems are embedded in economic production, public discourse, governance, and personal decision-making, yet there is little empirical infrastructure for tracking whether this integration erodes humans’ ability to meaningfully shape outcomes that affect their lives. We argue that measuring AI-induced disempowerment is both urgent and tractable, and lay out a research agenda for doing so. We first operationalize disempowerment through Sen’s model of agency and a three-layer model of exposure, erosion, and lock-in, applied across economic, political, and cultural domains at individual, institutional, and civilizational scales. We survey existing measurement efforts and show that current work clusters almost entirely at exposure, leaving erosion and lock-in largely unaddressed. We then propose six concrete metrics (centaur evaluations, disempowerment perception surveys, AI content saturation and cultural convergence monitoring, monitoring capital flow to and from human labor, human task frontier tracking, and institutional ethnography) and identify which actors are best positioned to implement each. We close by discussing limitations and open challenges, including construct validity across levels of analysis, causal attribution, the distinction between disempowerment and adaptation, and the political economy of measurement.
Recent work on evaluating the moral competence of large language models (LLMs) has focused primarily on what we call the moral value problem, i.e., whether model outputs align with human moral values. In contrast, the moral norm problem, i.e., whether models can identify and correctly apply context-sensitive moral norms, remains underexplored. We posit that this imbalance stems from the field’s reliance on descriptive ethics frameworks, such as Moral Foundations Theory and Kohlberg’s stages of moral development, which emphasize value representation over normative application. We review existing benchmarks and evaluation methods, and show that they cluster heavily around the value problem, while discussion regarding normative ethics remains underrepresented. We identify three crucial gaps: (i) the absence of high-quality groundtruth data for moral norms and their applications, (ii) insufficient evaluation of intermediate reasoning processes, and (iii) limited attention to the identification of morally relevant features in context. Subsequently, we propose a research agenda that includes the development of standardized formal representations for normative theories, the construction of expert-annotated datasets capturing norm application, and evaluation protocols that explicitly distinguish between values-level and normslevel competence. Our goal is to encourage a more systematic study of normative reasoning in LLMs.
The evaluation of explainable AI (XAI) methods is affected by a lack of standardization. Metrics are inconsistently defined, incompletely reported, and rarely validated against common baselines. In this paper, we identify transparency of evaluation reporting as a central, under-addressed problem. We propose the XAI Evaluation Card, a documentation template analogous to model cards, designed to accompany any study that introduces an XAI evaluation metric. The card covers explicit declaration of target properties, grounding levels, metric assumptions, validation evidence, gaming risks, and known failure cases. We argue that adopting this template as a community norm would reduce evaluation fragmentation, support meta-analysis, and improve accountability in XAI research.