Workshop on Natural Language Generation, Evaluation, and Metrics (2026)
up
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Simon Mille | Sebastian Gehrmann | Patrícia Schmidtová | Ondřej Dušek | Marzieh Fadaee | Kyle Lo | Enrico Santus | Gabriel Stanovsky
Simon Mille | Sebastian Gehrmann | Patrícia Schmidtová | Ondřej Dušek | Marzieh Fadaee | Kyle Lo | Enrico Santus | Gabriel Stanovsky
CoSy: Conversational Synthesis for Grounded Question Answering
Patrick Huber | Arash Einolghozati | Rylan Conway | Kanika Narang | Matt Smith | Waqar Nayyar | Adithya Sagar | Ahmed A Aly | Akshat Shrivastava
Patrick Huber | Arash Einolghozati | Rylan Conway | Kanika Narang | Matt Smith | Waqar Nayyar | Adithya Sagar | Ahmed A Aly | Akshat Shrivastava
High-quality, large-scale conversational datasets are scarce, making it difficult to train on-device language models (~1B parameters) as effective assistants. We introduce CoSy (Conversational Synthesis), a novel framework for generating diverse, steerable, multi-turn conversations at scale. CoSY combines three key mechanisms: (1) conversational graphs that ensure natural dialogue flow, (2) turn-based prompt augmentations for diversity, and (3) explicit linguistic phenomena for coherence. We evaluate CoSy on conversational grounded reasoning tasks (i.e. answering questions based on contextual information), a core on-device use case.Our on-device sized models trained on CoSy-synthesized data achieve competitive performance with human-annotated baselines and outperform instruction-tuned models of up to 70B parameters in zero-shot settings.
VAIDYA: Validated Agents for Intelligent Diagnosis and Yielded Analysis
Kalash Shah | Gautam Bhutani | Rohitaswa Sarbhangia | J Snehan
Kalash Shah | Gautam Bhutani | Rohitaswa Sarbhangia | J Snehan
Recent advances in large language models (LLMs) have demonstrated impressive medical reasoning capabilities. However, current evaluation methods are mostly limited to static case vignettes and multiple-choice questions which fail to reflect the complexity, uncertainty, and iterative nature of real-world clinical decision-making. To bridge this gap, we propose **DiagBench**, a novel benchmark where models interact dynamically with a LLM based Patient Simulator, querying relevant clinical details to formulate accurate diagnoses. To complement this, we introduce **MedConvBench**, a diagnostic conversation benchmark designed to assess the relevance and quality of model-generated clinical reasoning. To further address the interpretability and alignment challenges of AI-assisted diagnosis, we develop a modular and medically grounded framework called **VAIDYA** that mirrors a physician’s stepwise diagnostic reasoning. This structured approach improves transparency and yields substantial performance gains over base LLMs. Our work takes a critical step toward aligning AI systems with real-world clinical practices by combining dynamic interaction, interpretability, and clinical validation.
Self-Anchoring Calibration Drift in Large Language Models: How Multi-Turn Conversations Reshape Model Confidence
Harshavardhan
Harshavardhan
Self-Anchoring Calibration Drift (SACD), a tendency for large language models (LLMs) to show systematic changes in expressed confidence when building iteratively on their own prior outputs across multi-turn conversations. Through a controlled three-condition study comparing Claude Sonnet 4.6, Gemini 3.1 Pro, and GPT-5.2 across factual, technical, and open-ended domains, we find that SACD is real but multiform: models exhibit distinct self-anchoring signatures ranging from active confidence suppression to calibration improvement suppression, with effects concentrated in open-ended domains. These findings challenge the adequacy of single-turn calibration evaluation for characterizing LLM reliability in realistic multi-turn deployment contexts. Code and data are available at https://github.com/hvardhan878/calibration-drift
Temporal Tokenization Strategies for Event Sequence Modeling with Large Language Models
Zefang Liu | Nam H Nguyen | Yinzhu Quan | Shi-Xiong Zhang
Zefang Liu | Nam H Nguyen | Yinzhu Quan | Shi-Xiong Zhang
Representing continuous time is a critical and under-explored challenge in modeling temporal event sequences with large language models (LLMs). Various strategies like byte-level representations or calendar tokens have been proposed. However, the optimal approach remains unclear, especially given the diverse statistical distributions of real-world event data, which range from smooth log-normal to discrete, spiky patterns. This paper presents a systematic empirical study of temporal tokenization for modeling event sequences with LLMs, comparing distinct encoding strategies: naive numeric strings, high-precision byte-level representations, human-semantic calendar tokens, classic uniform binning, and adaptive residual scalar quantization. We evaluate these strategies by fine-tuning LLMs on real-world datasets that exemplify these diverse distributions. Our analysis reveals that no single strategy is universally superior; instead, prediction performance depends heavily on aligning the tokenizer with the data’s statistical properties, highlighting temporal tokenization as a critical yet often overlooked design dimension in LLM-based event modeling.
“Be My Cheese?”: Cultural Nuance Benchmarking for Machine Translation in Multilingual LLMs
Madison Van Doren | Casey Ford | Jennifer Barajas | Riley VanMeter | Cory Holland
Madison Van Doren | Casey Ford | Jennifer Barajas | Riley VanMeter | Cory Holland
We present a large-scale human evaluation benchmark for assessing cultural localisation in machine translation produced by state-of-the-art multilingual large language models (LLMs). Existing MT benchmarks emphasise token-level and grammatical accuracy, but often overlook pragmatic and culturally grounded competencies required for real-world localisation. Building on a pilot study of 87 translations across 20 languages, we evaluate 7 multilingual LLMs across 15 target languages with 5 native-speaker raters per language. Raters scored both full-text translations and segment-level instances of culturally nuanced language (idioms, puns, holidays, and culturally embedded concepts) on an ordinal 0–3 quality scale; segment ratings additionally included an NA option for untranslated segments. Across full-text evaluations, mean overall quality is modest (1.68/3): GPT-5 (2.10/3), Claude Sonnet 3.7 (1.97/3), and Mistral Medium 3.1 (1.84/3) form the strongest tier with fewer catastrophic failures. Segment-level results show sharp category effects: holidays (2.20/3) and cultural concepts (2.19/3) translate substantially better than idioms (1.65/3) and puns (1.45/3), and idioms are most likely to be left untranslated. These findings demonstrate a persistent gap between grammatical adequacy and cultural resonance. To our knowledge, this is the first multilingual, human-annotated benchmark focused explicitly on cultural nuance in translation and localisation, highlighting the need for culturally informed training data, improved cross-lingual pragmatics, and evaluation paradigms that better reflect real-world communicative competence.
Component Transfer Can Exceed Full Model Performance: Investigating Post-Trained Mixture-of-Experts
Rabin Tiwari
Rabin Tiwari
Post-training methods such as supervised fine-tuning and preference optimization are widely used to align large language models, yet how their benefits distributeacross architectural components and transfer across tasks and prompts remains unclear. In this work, we analyze component-level transfer in aMixture-of-Experts language model by selectively replacing routers, attention modules, and expert networks between two post-trained Mixture of Experts models trained with different post-training recipes and dataset mixtures. Starting from a SFT+DPO checkpoint, we systematically replace its components (routers, attention, experts) with those from a Tulu3 checkpoint and evaluate the impact of each replacement and their combinations on mathematical and scientific reasoningand a general-purpose classification task under zero-shot, few-shot and Chain of Thought prompting. We find strong component-specific specialization: expert networksaccount for most gains on mathematical and scientific reasoning, while attention mechanisms consistently outweigh expert transfer on general tasksand router transfer alone provides minimal benefit or harms performance. Prompting strategy further modulates these effects, with expert transfer degrading zero-shot scienceperformance but improving few-shot reasoning. Strategically combining components from different model versions can in some cases match or exceed the performance of the best available model, motivating principled techniques for composing post-trained models into task- and prompt-specific systems without additional training.
Reassessing Extractive QA Datasets at Scale: LLM-as-a-Judge and In-Depth Analyses
Xanh Ho | Jiahao Huang | Florian Boudin | Akiko Aizawa
Xanh Ho | Jiahao Huang | Florian Boudin | Akiko Aizawa
Extractive QA tasks are commonly evaluated using Exact Match (EM) and F1-score, but these metrics often fail to reflect true model performance. Recent studies have proposed using large language models (LLMs) as judges (LLM-as-a-judge), yet they often lack comprehensive evaluation across datasets and overlook key factors such as sensitivity to answer types, prompt variations, and self-preference bias.In this work, we conduct a systematic study of LLM-as-a-judge across four extractive QA datasets and various prompt variations, assessing multiple LLM families in both answering and judging roles. Our results show that LLM-as-a-judge judgments correlate much more strongly with human evaluations than EM (0.22) and F1 (0.40), achieving correlations up to 0.85 with open-source models.Further analysis reveals that LLM-as-a-judge performs particularly well on number-related answers but faces challenges with more complex types, such as job titles. Contrary to findings in other NLP tasks, we observe no self-preference bias, even when the same model serves as both QA model and judge. Finally, we find that prompt phrasing has minimal impact, and zero-shot, context-free judging often yields the best evaluation performance.
IndicMMLU-Pro: Benchmarking Indic Large Language Models on Multi-Task Language Understanding
Sankalp Jajee | Ashutosh Kumar | Nikunj Kotecha | Vinija Jain | Aman Chadha | Sreyoshi Bhaduri
Sankalp Jajee | Ashutosh Kumar | Nikunj Kotecha | Vinija Jain | Aman Chadha | Sreyoshi Bhaduri
Indic languages, spoken by over 1.5 billion people, pose unique challenges for NLP due to their cultural richness, linguistic diversity, and structural complexity. We present IndicMMLU-Pro, a comprehensive benchmark extending the MMLU-Pro framework to nine major Indic languages: Hindi, Bengali, Gujarati, Marathi, Kannada, Punjabi, Tamil, Telugu, and Urdu. Covering a wide range of tasks in comprehension, reasoning, and generation, IndicMMLU-Pro offers a standardized evaluation framework to advance AI model development in Indic contexts. This paper details the benchmark’s design, taxonomy, and data curation, and establishes baseline results using state-of-the-art multilingual models. As an open resource IndicMMLU-Pro aims to accelerate progress in Indic language technologies and support inclusive research in multilingual NLP.
Identifying Where Large Language Models Struggle in Answering Complex Questions
Xanh Ho | Florian Boudin | Saku Sugawara | Khoa Duong | Akiko Aizawa
Xanh Ho | Florian Boudin | Saku Sugawara | Khoa Duong | Akiko Aizawa
We design experiments to identify where Large Language Models (LLMs) struggle when answering complex questions.Our focus is on two key stages, mirroring the human QA process: 1) question decomposition, where the model breaks down a complex question into sub-questions and 2) subproblem solving, where it addresses each sub-question to obtain the final response.We preprocess and expand three multi-hop datasets to create experimental datasets featuring explicit and implicit multi-hop questions, crowdsourced and templated questions, and varying numbers of hops.Our results show that larger models (Llama 3.1 70B and o1) excel at decomposing explicit multi-hop questions but struggle with implicit ones, while smaller models (e.g., Llama 3.1 8B) have difficulty with both.In the sub-problem solving stage, all models perform well on simple questions with context.Furthermore, we found no correlation between accuracy in the question decomposition stage and final QA performance (direct response), highlighting a key difference between human and LLM reasoning.
More Yap Less Meaning: Uncovering Self-Improvement Behavior in SLMs
Marina Igitkhanian | Erik Arakelyan
Marina Igitkhanian | Erik Arakelyan
Recently, language models have made rapid progress across various domains and applications. However, their capability for self-improvement, i.e., whether they are adept at recognising and correcting flaws in their own reasoning, remains dubious. In this study, we address this question by constructing a sufficiency test to rigorously examine the self-correction capabilities of small language models (SLMs). We propose a minimal three-step self-correction pipeline that collects initial SLM answers, prompts the same model to generate hints for its incorrect responses given the ground truth, and feeds the model the same question with its own feedback to refine the initial answer. We evaluate a variety of instruction-tuned and reasoning SLMs in this experimental setup on arithmetic and logical reasoning benchmarks. Our findings show that SLMs with injected hint sentences yield only a 4.4$ % gain over initial question-answering accuracy. Even though the correct answer was provided alongside the model’s incorrect reasoning, the evaluated SLMs fail to understand what was missing in their reasoning and show minimal semantic difference between hints that lead to corrections and ones that do not. Furthermore, our experiments show that longer hints are positively correlated with incorrect final answers, suggesting that longer deliberation on problems can hinder the reasoning process, meaning that SLMs do not necessarily scale in performance with a larger compute budget.
Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents
Anh Ta | Junjie Zhu | Shahin Shayandeh
Anh Ta | Junjie Zhu | Shahin Shayandeh
Tool-calling agents are evaluated on tool selection, parameter accuracy, and scope recognition, yet LLM trajectory assessments remain inherently *post-hoc*. Disconnected from the active execution loop, such assessments identify errors that are usually addressed through prompt-tuning or retraining, and fundamentally cannot course-correct the agent in real time. To close this gap, we move evaluation into the execution loop at *inference time*: a specialized reviewer agent evaluates provisional tool calls *prior to* execution, shifting the paradigm from post-hoc recovery to proactive evaluation and error mitigation.In practice, this architecture establishes a clear separation of concerns between the primary execution agent and a secondary review agent. As with any multi-agent system, the reviewer can introduce new errors while correcting others, yet no prior work to our knowledge has systematically measured this tradeoff. To quantify this tradeoff, we introduce *Helpfulness-Harmfulness metrics*: helpfulness measures the percentage of base agent errors that feedback corrects; harmfulness measures the percentage of correct responses that feedback degrades. These metrics directly inform reviewer design by revealing whether a given model or prompt provides net positive value.We evaluate our approach on BFCL (single-turn) and 𝜏2-Bench (multi-turn stateful scenarios), achieving +5.5% on irrelevance detection and +7.1% on multi-turn tasks. Our metrics reveal that reviewer model choice is critical: the reasoning model o3-mini achieves a 3:1 benefit-to-risk ratio versus 2.1:1 for GPT-4o. Automated prompt optimization via GEPA provides an additional +1.5–2.8%. Together, these results demonstrate a core advantage of separating execution and review: the reviewer can be systematically improved through model selection and prompt optimization, without retraining the base agent.
RE-AD: Real-Time Requirement Adherence for Data Labeling
Siddarth Malreddy | Ishan Nigam | Akshay Arora | Nikhil Mittal | Subrat Sahu
Siddarth Malreddy | Ishan Nigam | Akshay Arora | Nikhil Mittal | Subrat Sahu
Human-annotated data remains fundamental to training frontier Large Language Models (LLMs). However, crowd-sourced annotations often suffer from quality issues stemming from annotator misunderstanding or lack of engagement. To address this, we introduce a real-time requirement adherence (RE-AD) framework that leverages LLMs to proactively validate labeling quality. Our methodology involves decomposing Standard Operating Procedures (SOPs) into atomic rules via self-reflection, categorizing them by complexity, and applying tiered validation strategies. Evaluated on a synthetic benchmark, the system achieved an F1 score of 0.749. Furthermore, production deployment resulted in annotators accepting and fixing 82% of the errors flagged by the framework. We include ablation studies to demonstrate the impact of our core design decisions.
General-purpose language models are trained to produce varied natural language outputs, but for some tasks, like annotation or classification, we need more specific output formats. LLM systems increasingly support structured output, which enforces formats by sampling tokens according to a grammar — but also unpredictably reduces downstream performance. Are there systematic differences between grammars that appear semantically (and often visually) similar to humans? To answer this, we test four popular model families with five varying output formats on four common NLP benchmarks. We find all models perform most accurately when guided to use formats respecting convention, such as letters for multiple choice and real numbers for numerical prediction. Performance also improves by 5%-10% when guiding models to return tokens incorporating leading whitespace, with smaller models benefiting the most. We find leading whitespace helps models avoid structural deficiencies in subword token representations. We finally present best practices for researchers using language models as zero-shot classifiers with structured output.
An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability
Yusuke Yamauchi | Taro Yano | Masafumi Oyamada
Yusuke Yamauchi | Taro Yano | Masafumi Oyamada
As large language models (LLMs) continue to advance, reliable evaluation methods are essential—particularly for open-ended, instruction-following tasks. LLM-as-a-Judge enables automatic evaluation using LLMs as evaluators, but its reliability remains uncertain. In this work, we analyze key factors affecting its trustworthiness, focusing on alignment with human judgments and evaluation consistency. Using BIGGENBench and EvalBiasBench, we study the effects of evaluation design, decoding strategies, and Chain-of-Thought (CoT) reasoning in evaluation. Our results show that evaluation criteria are critical for reliability, non-deterministic sampling improves alignment with human preferences over deterministic evaluation, and CoT reasoning offers minimal gains when clear evaluation criteria are present.
In many human-annotated NLP tasks involving ambiguity or subjective judgment, annotator disagreement reflects epistemic uncertainty rather than noise. Soft labeling (SL), which represents annotations as probability distributions rather than majority-vote (MV) labels, preserves this uncertainty and can improve downstream performance. We extend this perspective to LLM-based annotation by formalizing LLM soft labeling as introducing controlled variation in model-generated annotations to approximate the latent variability underlying human annotations. We distinguish two sources of variation: model-induced (e.g., stochastic decoding and model ensembles) and human-approximated (e.g., persona prompting and human-calibrated in-context annotation). Using the Gab Hate and GoEmotions datasets, we show that SL training consistently outperforms MV training under stronger LLM-based annotation strategies. Model ensembles produce the most informative soft-label distributions, achieving the best human–LLM agreement and downstream classification performance. These findings suggest that scalable LLM-based annotation pipelines can model epistemic uncertainty through diverse model-level variation without explicitly simulating human attributes.
Mind the Gap... or Not? How Translation Errors and Evaluation Details Skew Multilingual Results
Jan-Thorsten Peter | David Vilar | Tobias Domhan | Dan Malkin | Markus Freitag
Jan-Thorsten Peter | David Vilar | Tobias Domhan | Dan Malkin | Markus Freitag
Most current large language models (LLMs) support a wide variety of languages in addition to English, including high-resource languages (e.g. German, Chinese, French), as well as low-resource ones (e.g. Swahili, Telugu). In addition they have shown impressive capabilities in different domains, like coding, science and math. In this paper, taking math as an example domain, we study the performance of different LLMs across languages. Experimental results show that there exists a non-negligible and consistent gap in the performance of the models across languages. Interestingly, and somewhat against expectations, the gap exists for both high- and low-resource languages. These results should impact further research into cross-lingual capability generalization for next generation LLMs. Or they would, if it weren’t for the fact that they are false. By analyzing one of the standard multilingual math benchmarks (MGSM), we determine that several translation errors are present in the data. Furthermore, the lack of standardized answer extraction from LLM outputs further influences the final results. We propose a method for semi-automatic quality assurance to address the first issue at scale, and give recommendations to address the second one. Combining these two approaches we show that the aforementioned language gap mostly disappears, leading to completely different conclusions from our research. We additionally release the corrected dataset to the community.
MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following
Jaeyun Lee | Junyoung Koh | Zeynel Tok | Hunar Batra | Ronald Clark
Jaeyun Lee | Junyoung Koh | Zeynel Tok | Hunar Batra | Ronald Clark
Multi-constraint instruction following requires verifying whether a response satisfies multiple individual requirements, yet LLM judges are often assessed only through overall-response judgments. We introduce MCJudgeBench, a benchmark for constraint-level judge evaluation in multi-constraint instruction following. Each instance includes an instruction, a candidate response, an explicit constraint list, per-constraint gold labels in yes, partial, no, and controlled response-side perturbations. The evaluation protocol further includes evaluation prompt variants to test judge stability. We evaluate proprietary and open-source LLM judges using both correctness and inconsistency metrics, distinguishing intrinsic inconsistency under stochastic decoding from procedural inconsistency under prompt and response perturbations. Our results show that judge reliability has multiple dimensions: strong overall performance does not guarantee equally reliable detection across label categories, especially for rarer partial and no cases. Judges with higher correctness do not always have lower inconsistency. Evaluation with reasoning improves correctness but does not uniformly improve stability. These findings motivate evaluating LLM judges at the constraint level to study these failure modes.
MedAct: Removing the Human Bottleneck in Benchmarking Clinical LLM Safety
Arjun Krishna | Brian Pridgen | Max Silverstein
Arjun Krishna | Brian Pridgen | Max Silverstein
Most medical benchmarks for large language models test factual recall through multiple-choice questions, but on-the-ground physicians do not have the luxury of four options to choose from. NOHARM (Wu et al., 2025) demonstrated this limitation using 100 real eConsult cases annotated by 29 board-certified physicians, showing that action-level evaluation reveals omission and commission failure modes invisible to multiple-choice tests. However, NOHARM’s cases are closed and their creation required substantial expert physician time, creating a human bottleneck that limits the scalability and openness of this evaluation approach. We present MedAct, an open replication of NOHARM’s evaluation methodology using synthetically generated cases. Our contribution is a multi-stage generation pipeline that uses language models grounded in clinical practice guidelines to produce 100 cases across ten specialties, each containing roughly 50 plausible next-step actions labeled as Appropriate or Inappropriate using NOHARM’sscoring framework. The pipeline includes structural quality controls: 83 of 100 cases pass all five automated checks, and answer-leaking language appears in only 0.06% of actions. In a pilot evaluation of nine contemporary LLMs using this synthetic benchmark, we observe patterns consistent with NOHARM’s findings on human-curated cases, including that omissions dominate error volume while commissions dominate severe errors. We release all cases, rubrics, generation tooling, and scoring code openly, removing the human-bottleneck barrier to action-level clinical LLM evaluation.
Response Content Units: Evaluating Completeness and Proactiveness in Medical Open-Response Question Answering
Yongsin Park | Wen-wai Yim | Emma McKibbin | Asma Ben Abacha | Fei Xia
Yongsin Park | Wen-wai Yim | Emma McKibbin | Asma Ben Abacha | Fei Xia
Remote clinical care has significantly increased the workload for healthcare professionals managing digital inquiries. While automated systems aim to alleviate this burden, consumer health questions present unique challenges due to their linguistic complexity and the need for proactive clinical guidance, which traditional question-answering models often overlook. We introduce the medical Response Content Units (RCU) schema, a framework that facilitates automatic analysis to identify question-answer completeness and critical answer subparts, which can then be used as tools for supporting clinician response or for automatic metric evaluation. Our analysis using this schema reveals a 16.4% gap in response completeness in professional replies and demonstrates that essential medical directives are provided 2.4 to 12.1 times as frequently as direct answers. We provide baseline results and publicly release our annotations and source code to offer an evaluation framework that is more closely aligned with real-world clinical requirements.
NanoFlux: Adversarial Dual-LLM Evaluation and Distillation for Multi-Domain Reasoning
Raviteja Anantha | Soheil Hor | Teodor Nicola Antoniu | Layne C Price
Raviteja Anantha | Soheil Hor | Teodor Nicola Antoniu | Layne C Price
We present NanoFlux, a novel adversarial framework for generating targeted training data to improve LLM reasoning, where adversarially-generated datasets of ≤ 200 examples outperform conventional fine-tuning approaches. The framework employs a competitive dynamic between models alternating as Attacker and Defender, supervised by a tool-augmented Judge, synthesizing multi-step questions with explanatory annotations. Fine-tuning a 4B-parameter model on NanoFlux-generated data yields performance gains across diverse domains compared to full-benchmark fine-tuning: +5.9% on mathematical reasoning, +3.6% on scientific reasoning, and +16.6% on medical reasoning, while reducing computational requirements by 3-14×. Ablation studies reveal a non-monotonic relationship between dataset characteristics and model performance, uncovering domain-specific optimal points for question complexity and reasoning quality. NanoFlux automates training data generation through embedding-based novelty filtering, tool-augmented evaluation, and multi-hop reasoning, pointing to the value of small, targeted training datasets.
Evaluating the Reliability of LLMs in Faithfully Updating Text: An Empirical Study
Ayan Datta | Paheli Bhattacharya | Rishabh Gupta
Ayan Datta | Paheli Bhattacharya | Rishabh Gupta
We provide a comprehensive review of the FRUIT (Faithfully Reflecting Updated Information in Text) task, which formalizes the challenge of accurately updating textual information with large language models (LLMs). Our work begins with an in-depth analysis of the FRUIT dataset, revealing key structural insights. We also investigate the unsupervised capabilities of LLMs—such as zero-shot learning, chain-of-thought reasoning, self-reflection, and evidence ordering. Experimental results demonstrate that unsupervised approaches perform competitively with supervised methods in faithful text updating. Qualitative analysis shows that updates utilizing table-structured evidence outperform those based on unstructured text. We also discuss important limitations, including the need for new datasets and the risks of information leakage in this domain. These findings have significant implications for applications requiring precise document updates, such as software engineering, technical documentation, and legal document maintenance.
Not All Tokens Are Equal: Per-Dimension Top-K Pooling for Adversarially Robust BERT Classification
Manoranjan Dash | Shivam Anand Aralikatti | Shanay Sheth | Pranav Shinde
Manoranjan Dash | Shivam Anand Aralikatti | Shanay Sheth | Pranav Shinde
Contextual text classification with BERT typically relies on the [CLS] token representation for downstream prediction. While effective under standard conditions, [CLS]-based pooling is brittle under adversarial perturbation, as its single-vector representation is indiscriminately influenced by injected adversarial tokens. We propose Per-Dimension Top-K Average Pooling, a pooling strategy that, for each hidden dimension, selectively aggregates only the top-K token activations rather than the full sequence — effectively controlling which tokens contribute to the final representation. This token-level selectivity acts as a natural filter against adversarial injection: tokens that do not rank among the top-K for a given dimension are suppressed from aggregation. We evaluate our approach against CLS, Global Average Pooling (GAP), Global Max Pooling (GMP), and Hybrid variants across three text classification domains: spam detection (Enron and LingSpam), automated essay scoring (ASAP), and hate speech classification. On the Enron spam dataset under adversarial attack, our best Hybrid (K=3) variant reduces the Attack Success Rate from 70.65% to 37.07% while maintaining clean accuracy above 99%, compared to CLS which degrades to 63.64% adversarial accuracy. Representation-level analyses further corroborate these findings: Top-K pooling variants exhibit substantially lower cosine similarity shift under attack, and adversarially injected tokens enter the top-K selection in far fewer dimensions compared to CLS. These results suggest that per-dimension token selectivity offers a principled and lightweight mechanism for adversarial robustness in BERT-based classifiers without any modification to the underlying model architecture.
Near-Miss: Latent Policy Failure Detection in Agentic Workflows
Ella Rabinovich | David Boaz | Naama Zwerdling | Ateret Anaby Tavor
Ella Rabinovich | David Boaz | Naama Zwerdling | Ateret Anaby Tavor
Agentic systems for business process automation often require compliance with policies governing conditional updates to the system state. Evaluation of policy adherence in LLM-based agentic workflows is typically performed by comparing the final system state against a predefined ground truth. While this approach detects explicit policy violations, it may overlook a more subtle class of issues in which agents bypass required policy checks, yet reach a correct outcome due to favorable circumstances. We refer to such cases as near-misses or latent failures. In this work, we introduce a novel metric for detecting latent policy failures in agent conversations traces. Building on the ToolGuard framework, which converts natural-language policies into executable guard code, our method analyzes agent trajectories to determine whether agent’s tool-calling decisions where sufficiently informed.We evaluate our approach on the 𝜏2-verified Airlines benchmark across several contemporary open and proprietary LLMs acting as agents. Our results show that latent failures occur in 8–17% of trajectories involving mutating tool calls, even when the final outcome matches the expected ground-truth state. These findings reveal a blind spot in current evaluation methodologies and highlight the need for metrics that assess not only final outcomes but also the decision process leading to them.
Evaluating Counterfactual Strategic Reasoning in Large Language Models
Dimitrios Georgousis | Maria Lymperaiou | Angeliki Dimitriou | Giorgos Filandrianos | Giorgos Stamou
Dimitrios Georgousis | Maria Lymperaiou | Angeliki Dimitriou | Giorgos Filandrianos | Giorgos Stamou
We evaluate whether LLMs adapt their strategic behavior when familiar games are counterfactually modified. We introduce a repeated-game evaluation framework covering Prisoner’s Dilemma and Rock–Paper–Scissors under default, label-perturbed, payoff-perturbed, and joint counterfactual variants. This design separates surface robustness to renamed actions from deeper sensitivity to changed incentives. Across multiple frontier LLMs, we find that label perturbations usually cause moderate degradation, whereas payoff perturbations expose stronger failures: LLMs often preserve canonical strategies even when the equilibrium structure changes. In RPS, several LLMs remain close to uniform play despite a payoff-counterfactual equilibrium requiring a biased mixed strategy. Behavioral and efficiency metrics further show that stronger or reasoning-enabled LLMs are not uniformly more strategic: some deliberate more without adapting faster. Overall, counterfactual repeated games provide a compact diagnostic for distinguishing robust incentive-sensitive behavior from brittle template-based strategic execution.
Speculative Refinement: A Hybrid Autoregressive Diffusion Decoding Strategy and Its Behavior Across Benchmarks
Aditi Gupta | Neel Mishra | Kushagra Trivedi | Pawan Kumar
Aditi Gupta | Neel Mishra | Kushagra Trivedi | Pawan Kumar
How should we evaluate generation systems that combine autoregressive (AR) and diffusion decoding?We study this question through *Speculative Refinement* (SpecRef), a training-free hybrid method that warm-starts a masked diffusion language model from an AR draft using entropy-guided selective masking.Evaluating SpecRef across six benchmarks (HumanEval, MBPP, GSM8K, BBH, ARC-Challenge, HellaSwag) with three distinct evaluation protocols (execution-based pass@1, exact-match, log-likelihood scoring), we surface several findings relevant beyond our specific system:(1) code benchmarks conflate structural discovery with logical correctness: providing a syntactic scaffold lifts accuracy from near zero to over 20% without changing the model, indicating that much of the baseline failure is structural;(2) a *refinement tension* phenomenon where multi-stage correction degrades already-correct tokens, exposing benchmark saturation ceilings invisible to single-model evaluation;(3) log-likelihood and generative evaluation produce different model rankings for the same model pair, suggesting they measure different capabilities;(4) standard Python post-processing silently breaks code evaluation for non-AR generators.These observations apply to any multi-stage or non-autoregressive generation pipeline and point toward more diagnostic evaluation practices.
SAUCE: Summary Analysis Using Conversation Entailment
Man-Ling Sung | Hemanth Kandula | Jeff Ma | William Hartmann | Matthew Snover
Man-Ling Sung | Hemanth Kandula | Jeff Ma | William Hartmann | Matthew Snover
With the growing need for evaluating Large Language Models (LLMs) and their applications to speech, challenges persist in summarizing and evaluating conversations that lack a clear end goal. We introduce SAUCE – a reference-free, fact-based evaluation pipeline for cross-lingual conversational speech summarization. It measures the accuracy and the fact coverage of a summary through the entailment between conversation and text. We compare SAUCE against several popular summarization metrics and demonstrate the effectiveness of capturing information loss due to transcription and translation error and identifying broken summaries. Crucially, unlike black-box LLM evaluators or dense embedding metrics, SAUCE is inherently explainable: it maps summary scores to discrete, verifiable facts, allowing users to pinpoint exact hallucinations or omissions. We illustrate how this interpretability helps developers systematically profile LLM behaviors and gives end-users an actionable tool to verify summary accuracy in noisy, real-world conditions. Preliminary investigations show SAUCE strongly align with human judgment.
Evaluating ASR Quality at Scale on TV Entertainment Platforms
Adeep Hande | Kishorekumar Sundararajan | Yidnekachew Endale | Akshatha Bapu KrishnaSwamy | Sachin Dabral | Dawn Reed | Michael Pereira
Adeep Hande | Kishorekumar Sundararajan | Yidnekachew Endale | Akshatha Bapu KrishnaSwamy | Sachin Dabral | Dawn Reed | Michael Pereira
Fine-Tuning vs. RAG for Multi-Hop Question Answering with Novel Knowledge
Zhuoyi Yang | Yurun Song | Kyler G. Harris | Iftekhar Ahmed | Ian Harris
Zhuoyi Yang | Yurun Song | Kyler G. Harris | Iftekhar Ahmed | Ian Harris
Multi-hop question answering is widely used to evaluate the reasoning capabilities of large language models (LLMs), as it requires integrating multiple pieces of supporting knowledge to arrive at a correct answer. While prior work has compared fine-tuning and retrieval-augmented generation (RAG) for factual recall and single-hop question answering, it remains unclear how these approaches perform in multi-hop settings that require compositional reasoning over temporally novel knowledge. In particular, prior comparisons often do not control for model scale, evaluation format, or knowledge freshness, making it difficult to isolate the effect of knowledge injection mechanisms.In this paper, we systematically compare parametric and non-parametric knowledge injection methods for open-domain multi-hop question answering. We evaluate unsupervised fine-tuning (continual pretraining), supervised fine-tuning, and retrieval-augmented generation across three 7B-parameter open-source LLMs. Experiments are conducted on two benchmarks: Question Answering Science Challenge (QASC), a standard multi-hop science question answering dataset, and a newly constructed dataset of over 10,000 multi-hop questions derived from Wikipedia events in 2024, which is designed to test knowledge beyond the models’ pretraining cutoff.Our results show that unsupervised fine-tuning provides only limited gains over base models, suggesting that continual pretraining alone is insufficient for improving multi-hop reasoning accuracy. In contrast, RAG yields substantial and consistent improvements, particularly when answering questions that rely on temporally novel information. Supervised fine-tuning achieves the highest overall accuracy across models and datasets. These findings highlight fundamental differences in how knowledge injection mechanisms support multi-hop question answering and underscore the importance of retrieval-based methods when external or compositional knowledge is required.
MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models
Weixin Liu | Congning Ni | Shelagh A. Mulvaney | Susannah L. Rose | Murat Kantarcioglu | Bradley A. Malin | Zhijun Yin
Weixin Liu | Congning Ni | Shelagh A. Mulvaney | Susannah L. Rose | Murat Kantarcioglu | Bradley A. Malin | Zhijun Yin
Large language models (LLMs) are increasingly used in the mental health domain, yet it remains unclear how well they capture related biomedical knowledge and how reliably they apply it to clinically salient structured judgments. Here, we present a knowledge-graph (KG)-grounded benchmark for assessing LLMs on mental-health entity recognition, relation judgment, and two-hop reasoning. The benchmark is derived from PrimeKG and comprises nine task families with KG-supported answers and controlled negative options. Experiments across 15 closed- and open-source LLMs reveal a persistent recognition-to-judgment gap: leading models achieve near-ceiling performance on entity typing and on the small relation-typing subset, yet they still struggle with relation prediction and two-hop reasoning. Additionally, short KG-derived snippets benefit some models but degrade performance for others. Moreover, output-format reliability can substantially influence measured performance under constrained multiple-choice settings, highlighting the critical role of response validity in benchmark-based evaluation. MHGraphBench should therefore be interpreted as evaluating agreement with a curated mental-health slice of PrimeKG under a constrained multiple-choice interface, rather than as a direct assessment of real-world clinical safety.
A Progressive Evaluation Framework for Multicultural Analysis of Story Visualization
Janak Kapuriya | Ali Hatami | Paul Buitelaar
Janak Kapuriya | Ali Hatami | Paul Buitelaar
Recent advancements in text-to-image generative models have improved narrative consistency in story visualization. However, current story visualization models often overlook cultural dimensions, resulting in visuals that lack cultural fidelity. In this study, we present a progressive evaluation framework for story visualization. We validate this framework on current text-to-image models across three languages (English, Hindi, and Chinese) on two datasets (VIST and FlintstonesSV). The proposed framework introduces three levels of cultural analysis as evaluation rubrics: 1) Basic Cultural Criteria, 2) Cultural Dimension Guidance, and 3) Cultural Examples Grounding. We evaluate story visualization by use of a novel MLLM-as-Jury approach across all three rubrics and a small-scale human evaluation only on the third rubric. We implement an MLLM-as-jury approach by aggregating scores from three different families of MLLM-as-Judge models. In our experiments, real-world stories generally receive higher cultural appropriateness scores than animated ones, with English tending to score higher than Hindi and Chinese across the evaluated models. Some examples also exhibited culturally inconsistent or stereotypical elements noted by annotators. The proposed progressive evaluation framework has therefore been shown to provide early insights into cultural misalignments in story visualization. Code for this work is made available on https://github.com/janak11111/Cultural_Eval_For_StoryViz
Is GraphRAG Needed? From Basic RAG to Graph-/Agentic Solutions with Context Optimization
Long Chen | Ryan Razkenari | Yuxuan Zhou | Yuan Tian | Rahul Ghosh | Venkatesh Pappakrishnan | Disha Ahuja | Vidya Sagar Ravipati
Long Chen | Ryan Razkenari | Yuxuan Zhou | Yuan Tian | Rahul Ghosh | Venkatesh Pappakrishnan | Disha Ahuja | Vidya Sagar Ravipati
As advanced RAG variants like GraphRAG and Agentic RAG emerge, one leading question is when and how to use them. Here, we introduce a framework for different RAG scenarios evaluation and comparison on semi-structured knowledge bases, including regular RAG, GraphRAG, Modular RAG and Agentic RAG. We provide implementation for 9 standardized RAG scenarios, and conduct experiments for a comprehensive comparison. These scenarios are designed for real use cases regarding data and domain restrictions, spanning from simple document-based retrieval to advanced features such as hybrid text-graph retrieval, integration with computed or pre-defined domain knowledge graphs, agentic multi-step planning, and agent-graph integration. Besides, we present a novel context engineering method for GraphRAG and Agentic RAG, addressing the context/memory overflow issues, efficiently managing text and graph retrievals with new representations and agentic loop design, leading to 19%-53% reduction on token usage. Moreover, further analysis identifies a retrieval-generation gap where expanded retrieval does not proportionally improve generation quality, suggesting retrieval-oriented metrics overstate advanced retrieval benefits. This work provides data-driven insights on when and how to use them for building production-ready intelligent RAG systems.
Cross-Domain Semantic Fidelity Evaluation for Meaning-to-Text Generation
Davan Harrison | Marilyn Walker
Davan Harrison | Marilyn Walker
Slot Error Rate (SER) is the standard metric for evaluating semantic accuracy in meaning-to-text generation, but computing it has historically required domain-specific scripts that do not generalize across datasets. We present a cross-domain SER evaluation framework that replaces hand-crafted rules with a learned slot extraction model. We adapt Llama-3.2-3B-Instruct with LoRA, updating only 0.34% of its parameters, and show that this small adapted model outperforms prompted frontier LLMs by a wide margin on structured extraction across 23 dialogue domains. We further apply overgenerate-and-rank to the extraction task itself, generating multiple candidate meaning representations and selecting the best one with a trained ranker, which improves SER-Accuracy from 75% to 88%. We combine the extraction model with a Natural Language Inference (NLI) verification baseline through learned per-example routing, achieving 90.0% accuracy on held-out evaluation pairs without any domain-specific rule engineering. We compare our framework against published rule-based SER tools and show that our learned approach matches or outperforms hand-crafted scripts on all six comparable domains.
E-star 12B: Reliable Rubric-Following and Domain-Adaptive SLM Evaluator for Korean Industrial Settings
Yonghoon Kwon | Heondeuk Lee | Barom Kang
Yonghoon Kwon | Heondeuk Lee | Barom Kang
Automatic evaluation in industrial settings requires models to interpret and apply natural language rubrics reliably under language and domain shift. This challenge is compounded when reference answers are unavailable and proprietary models cannot be deployed due to data-governance constraints. We present E-Star-12B, a 12B-parameter evaluator for Korean industrial environments that jointly addresses rubric following and domain adaptation. Our approach combines a structured evaluation format—feedback, highlight, and decision—with a 6K high-confidence training set via multi-stage consensus-based filtering. We introduce two benchmarks: Ko Feedback Bench for rubric-following evaluation under Korean language transfer, and RAG Quality Bench for domain-specific evaluation in financial and legal settings. E-Star-12B achieves the strongest rubric alignment among small language models on Ko Feedback Bench, improving Pearson correlation by +0.173 over its base model. On RAG Quality Bench, the domain-adapted variant approaches frontier-model performance with more stable adaptation than general instruct models. Strong rubric-following capability serves as a reliable scaffold for subsequent domain adaptation.
Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations
Sachin Kumar
Sachin Kumar
Linear probes trained on internal activations of Large Language Models (LLMs) are increasingly proposed as evaluation metrics for deceptive generation, automated monitors that score whether a model’s output was produced deceptively, without requiring ground-truth labels or human annotation. Yet these metrics report AUROC scores exceeding 0.96 on clean benchmarks while demonstrating profound fragility under distributional shift. This paper presents a systematic pressure-test of such probe-based evaluation metrics across the Gemma 3 model family (1B–27B parameters), diagnosing why they fail rather than merely documenting that they fail. We investigate four competing hypotheses about how deception is encoded: as (1) a single linear direction, (2) a multi-dimensional subspace, (3) a convex conic hull, or (4) a proxy for computational entropy. Our experimental design includes cross-domain transfer matrices, multi-dimensional probe analysis with permutation null baselines, entropy-residualization tests, and systematic distractor evaluations across 8 stylistic shifts. Across all four model scales, we find that: (a) probe-based metrics achieve near-perfect AUROC (≥0.998) on clean data but collapse under stylistic shifts when trained without stylistic augmentation, style-augmented probes recover near-perfect detection (mean AUROC 0.979–0.983) even on unseen styles; (b) the single-direction hypothesis is decisively rejected (k=1 captures only 0.61–0.80 AUROC of the signal, with cross-domain transfer failure confirmed as geometric rather than layer-mismatch-driven; (c) the entropy-proxy hypothesis is rejected (maximum |𝜌|=0.454, maximum 𝛥AUROC after residualization=0.004); and (d) deception does not form a statistically significant linear subspace even within individual domains (per-domain k*=0), yet multi-dimensional probes (k≥5) consistently recover the signal through distributed sub-threshold features. These findings demonstrate that probe fragility under standard training reflects distributional narrowness rather than a fundamental architectural limitation: style-augmented probes recover near-perfect detection (mean AUROC 0.979–0.983 on unseen styles) at both the 4B and 27B scales, establishing that the inverse scaling pattern observed under standard training is a training-distribution artifact rather than a genuine scale-dependent phenomenon.
Sycophancy Negatively Affects LLM-as-a-Judge in Conflict Evaluation
Naghmeh Farzi | Laura Dietz | Samuel Carton
Naghmeh Farzi | Laura Dietz | Samuel Carton
LLM-as-Judge systems are increasingly used to generate labels and evaluate conversational data, yet their susceptibility to narrative framing remains underexplored. We study whether replacing one speaker’s username with the first-person identifier ’Me’ systematically biases model judgments independent of the underlying evidence. Using the Conversations Gone Awry corpus, we evaluate four LLMs across three judgment tasks (attack detection, attacker identification, and blame attribution), three perspective conditions, and two evidence visibility settings. Our results show that narrative perspective induces strong, task-dependent distortions, particularly in more subjective judgment tasks. We find that models systematically favor the narrator when a speaker is presented as ’Me’, reducing blame and responsibility attribution toward that speaker even when the underlying evidence is unchanged. These findings raise concerns about using LLMs to judge or moderate first-person conversational data.
Concord: An Agreement-Aware Multi-Adjudication Pipeline for LLM Evaluation
Tyler Bliss | Mahit Verma | Aila Iyer-Singh | Subrata Biswas | Sheikh Asif Imran | Bashima Islam
Tyler Bliss | Mahit Verma | Aila Iyer-Singh | Subrata Biswas | Sheikh Asif Imran | Bashima Islam
Evaluating multimodal generations is challenging: human evaluation is costly, and single-model LLM-as-a-judge pipelines can be brittle and provide limited uncertainty signals. We introduce Concord, an ensemble-based evaluation pipeline that aggregates discrete judgments from multiple LLM judges and uses inter-judge agreement as a practical uncertainty signal for disagreement-driven triage. We evaluate Concord on AVSSD and SCORE-AVS, a ground-truth-supervised audio-visual benchmark with discrete labels (True/False or 0–5). Concord improves agreement with human judgments over single-judge and naive aggregation baselines, and prioritizing low-agreement instances focuses human review on the most ambiguous cases. We use locally hosted open-source judges and include the binary results for online larger scale models GPT4.o mini turbo and Gemini 3.1 Flash Lite.
The Silent Vote: Improving Zero-Shot LLM Reliability by Aggregating Semantic Neighborhoods
Sanket Badhe | Priyanka Tiwari | Deep Shah
Sanket Badhe | Priyanka Tiwari | Deep Shah
Large Language Models are increasingly used as zero-shot classifiers in complex reasoning tasks. However, standard constrained decoding suffers from a phenomenon we define as Renormalization Bias. When a model is restricted to a small set of target labels, the standard softmax operation discards the probability mass assigned to semantic synonyms in the original distribution. This loss of information, which we call the Silent Vote, results in artificial overconfidence and poor calibration. We propose Semantic Softmax, an inference-time layer that recovers this lost information by aggregating the scores of the semantic neighborhood surrounding each target label. We evaluate this approach on Qwen-3 and Phi-4-mini models using GoEmotions and Civil Comments datasets. Our results demonstrate consistent improvements across all evaluation metrics: Semantic Softmax substantially reduces Expected Calibration Error (ECE) and Brier Score, while simultaneously enhancing discriminative performance in terms of AUROC and Macro-F1. By accounting for linguistic nuances, our method provides a more calibrated and accurate alternative for zero-shot classification.
Are LLM Benchmarks Already Contaminated? A Systematic Review of Contamination Detection Methods
Erfan Nourbakhsh | Mohammad Sadegh Sirjani | Amir Mousavi | Khoa Nguyen | John Quarles | Mimi Xie | Rocky Slavin
Erfan Nourbakhsh | Mohammad Sadegh Sirjani | Amir Mousavi | Khoa Nguyen | John Quarles | Mimi Xie | Rocky Slavin
Large Language Models (LLMs) are trained on web-scale corpora, increasing the risk that benchmark test data appears in training sets and inflates reported performance. We present a systematic literature review of 55 studies on LLM benchmark contamination through late 2025. Our contributions are: (1) a four-tier contamination taxonomy (Exact, Syntactic, Semantic, Task-Level; T1–T4); (2) a comparative analysis of five detection families (string-matching, likelihood-based, membership inference, LLM-prompted detection, and benchmark auditing), including access assumptions and failure modes; (3) a synthesis of contamination evidence on MMLU, GSM8K, HumanEval, and HellaSwag by measurement construct; (4) a comparative evaluation of mitigation strategies across lifecycle points, access assumptions, and evidence maturity; and (5) a Contamination Transparency Card (CTC) framework for future releases. Across studies, no detection method is consistently reliable across contamination tiers, model-access settings, and training stages. We identify instruction tuning as a persistent blind spot, note that RL/post-training contamination auditing is only beginning to mature, and report inflation estimates spanning roughly 6%–40% under benchmark- and setting-dependent assumptions.
Language models (LMs) are known to be prone to response biases, which present as option preference biases in fixed-response questions. It is therefore imperative to develop low-cost and effective response bias correction methods to improve LM performance and enable more accurate evaluations of model abilities. Here, we propose a simple response bias correction strategy, RBCorr, and test it on 12 open-weight language models using yes-no, entailment, and multiple choice questions. We show that response bias is prevalent in LMs pre-correction and that RBCorr effectively eliminates bias and boosts model performance. We also explore the generalizability of bias behavior across models, datasets, and prompt formats, showing that LogProbs-based correction is highly dependent on all three of these aspects. Overall, RBCorr is an easy-to-use method that can boost the performance of smaller LMs and ensure that LM performance on closed-response benchmarks aligns more closely with their true capabilities.
Recent studies have highlighted that Large Language Models (LLMs) often exhibit limited coherence, that is the ability to produce consistent responses to semantically equivalent questions. While most prior research has focused exclusively on English, limited investigation has been conducted on other languages. In this work, we study the coherence of LLMs on Question Answering tasks across six languages: English, Italian, German, Chinese, Japanese, and Vietnamese. We evaluate models of varying sizes, ranging from 3.8B to 235B parameters, to examine how coherence scales with model capacity and how it relates to languages. Our findings reveal that (i) coherence is not uniquely related to model size and accuracy and (ii) for some models, coherence varies significantly between languages.
Token Cost Inequality: Measuring Tokenization Disparities Across Scripts in Roman Urdu and Urdu
Waleed Jamil | Saima Rafi | Yanchao Yu
Waleed Jamil | Saima Rafi | Yanchao Yu
Tokenization is central to modern language models, yet its effects on cross-script efficiency, input cost, and truncation behavior remain underexplored. We study this issue through aligned comparisons of Urdu and Roman Urdu, asking whether semantically equivalent content incurs systematically different tokenization costs across scripts. We introduce Token Cost Inequality (TCI), a metric for quantifying relative tokenization efficiency under semantic alignment, and propose a multi-axis framework spanning token cost, fragmentation, and fixed-budget retention. Across three tokenizer families (cl100k, mT5, and ByT5), we find that tokenization disparities are strongly tokenizer-dependent, with substantial differences in token cost and segmentation behavior across scripts. We further identify an efficiency-retention paradox: token cost alone does not fully explain truncation behavior. Under fixed token budgets, Roman Urdu preserves more character-level content than native Urdu, reflecting differences in character-per-token density and fragmentation. Lightweight normalization yields minimal gains, suggesting that the observed disparities arise primarily from tokenizer design rather than superficial orthographic variation. These findings provide controlled evidence that fixed token budgets can produce unequal surface-coverage conditions across scripts, with implications for input-side cost estimation, benchmark design, and multilingual evaluation under constrained token budgets.
Semantic vs. Structural Signals: Log-Probability and LLM-as-a-Judge for Reference-Free Code Evaluation
Dmitriy Fedrushkov | Yulong He | Ivan Smirnov | Artem Aliev | Sergey Kovalchuk
Dmitriy Fedrushkov | Yulong He | Ivan Smirnov | Artem Aliev | Sergey Kovalchuk
Reference-free evaluation of LLM-generated code is essential when execution-based testing is unavailable or costly. We compare two paradigms: explicit LLM-as-a-Judge scoring, which assigns a quality score to a solution, and log-probability scoring, which uses log P𝜃(code ∣ task) as an instruction-free signal.Across HumanEval-X, we find that the two approaches capture qualitatively different aspects of code correctness. Explicit judges — particularly larger models — perform strongly on generated code, reflecting their ability to reason about task-solution alignment, but fail to distinguish correct solutions from minimally mutated ones. Log-probability exhibits the opposite pattern: weaker performance on generated code, but consistent pairwise separation of canonical from mutated solutions.These results reveal a discrimination-ranking dissociation and show that the two paradigms provide complementary, non-interchangeable signals: explicit judges capture semantic correctness, while log-probability captures local structural consistency.
Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges
Srimonti Dutta | Akshata Kishore Moharir
Srimonti Dutta | Akshata Kishore Moharir
LLM-as-judge evaluation is widely used in benchmarking pipelines, where model outputs are compared and ranked using automated evaluators. These pipelines typically assume that judgments are stable properties of fixed inputs. We show that this assumption does not hold under interaction.We study post-decision manipulability: the extent to which an evaluation outcome can be altered through subsequent conversation with the judge after an initial decision has been made. Across controlled experiments on MT-Bench and AlpacaEval, we find that LLM judges are highly stable under repeated and neutral reevaluation, yet become substantially reversible under targeted post-decision challenge. An anti-baseline challenge protocol shows that stable judgments can be overturned through motivated interaction, while a counterbalanced target-validation protocol separates this reversibility from net target-directed steering.These reversals have practical consequences: they can degrade agreement with human preferences, shift benchmark rankings, and produce harmful evaluation changes despite high self-reported confidence. Authority framing is especially destabilizing, and revised judgments are often accompanied by low-overlap justifications, suggesting post hoc rationalization rather than reliable error correction. We introduce the Evaluation Robustness Score (ERS) to quantify interactional robustness by combining reversal susceptibility with counterbalanced directional effects. Our findings identify post-decision interaction as a distinct failure mode for LLM-as-judge evaluation and motivate evaluation protocols that measure not only static agreement, but robustness under challenge.
Permutation-Consensus Listwise Judging for Robust Factuality Evaluation
Tianyi Huang | Nathan Huang | Justin Tang | Wenqian Chen | Elsa Fan
Tianyi Huang | Nathan Huang | Justin Tang | Wenqian Chen | Elsa Fan
Large language models (LLMs) are now widely used as judges, yet their decisions can change under presentation choices that should be irrelevant. We study one such source of instability: candidate-order sensitivity in listwise factuality evaluation, where several answers can look similarly polished while differing substantially in hallucination risk. We introduce PCFJudge, an inference-time method that reruns the same factuality-first listwise prompt over multiple orderings of the same candidate set and aggregates the resulting scores, ranks, and uncertainty signals into a single consensus decision. On RewardBench 2 Factuality, the final seven-permutation aggregate (K=7) improves top-1 selection accuracy from 86.00% to 91.33% with GPT-5.4 and from 86.33% to 89.67% with Claude Sonnet 4.6. These results suggest that candidate order can be a meaningful source of factuality-judging error and that marginalizing over this nuisance variation can improve the reliability of LLM evaluation.
MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts
Jiayi He | Yangmin Huang | Qianyun Du | Xiangying Zhou | Zhiyang He | Jiaxue Hu | Xiaodong Tao | Lixian Lai
Jiayi He | Yangmin Huang | Qianyun Du | Xiangying Zhou | Zhiyang He | Jiaxue Hu | Xiaodong Tao | Lixian Lai
Deploying Large Language Models (LLMs) in medical applications requires rigorous fact-checking to ensure patient safety and regulatory compliance. We introduce **MedFact**, a challenging Chinese medical fact-checking benchmark with 2,116 expert-annotated instances from diverse real-world texts, spanning 13 specialties, 8 error types, 4 writing styles, and 5 difficulty levels. Construction uses a hybrid AI-human framework where iterative expert feedback refines AI-driven, multi-criteria filtering to ensure high quality and difficulty. We evaluate 20 leading LLMs on veracity classification and error localization, and results show that models can often determine whether text contains errors but struggle to localize them precisely, with top performers falling short of human performance. Our analysis reveals an "over-criticism" phenomenon, where models misidentify correct information as erroneous, a tendency that is aggravated by advanced reasoning techniques such as multi-agent collaboration and inference-time scaling. MedFact highlights the challenges of deploying medical LLMs and provides resources to develop factually reliable medical AI systems.
Early-Token Confidence Predicts Reasoning Quality in Multi-Agent LLM Debate
Ali Keramati | Justin Cheok | Jacob Horne | Mark Warschauer
Ali Keramati | Justin Cheok | Jacob Horne | Mark Warschauer
Evaluating reasoning quality in multi-agent LLM systems is challenging, especially for open-ended tasks without reference answers. We investigate whether intrinsic confidence signals, token-level log-probabilities from decoding, can predict reasoning quality as assessed by LLM-as-judge evaluation. Using a debate-based essay scoring framework, we compare confidence proxies against rubric-based judge scores across two ASAP essay sets. We find that early-token confidence, particularly within the first few generated tokens, is consistently the strongest predictor of reasoning quality, outperforming full-sequence statistics. Analysis of log-probability trajectories shows that the opening phase of generation is the most heterogeneous and therefore most informative. We also observe a systematic asymmetry between agent roles, with stronger alignment between confidence and quality for supportive reasoning than for adversarial critique. These results suggest that early decoding dynamics provide a lightweight and effective signal for estimating reasoning reliability in multi-agent LLM systems.
Complex-IF and Beyond: Expert Rubrics for RLVR
Sushant Mehta | Liudas Panavas | Eleanor Fleming | Paul Mains | Edwin Chen
Sushant Mehta | Liudas Panavas | Eleanor Fleming | Paul Mains | Edwin Chen
As LLM capabilities advance rapidly, the evaluation methods used to assess them increasingly lag behind. Traditional benchmarks rely onprogrammatic verification of narrow, surface-level constraints, yet real-world instruction following and agentic tasks demand assessmentof nuanced, context-dependent behaviors that resist simple scripted checks. We present a systematic analysis of expert-curated rubric-based evaluation as an alternative paradigm, drawing on empirical evidence from two domains: complex instruction following and enterprise agentic tasks. We first articulate five design principles for constructing high-quality rubrics, including Maximum Viable Atomicity, intent-aware criterion design, and iterative LLM-judge calibration. To validate these principles, we introduce COMPLEX-IF, a new expert-curated instruction-following dataset in which each prompt is paired with 10–40 atomic rubric criteria. We demonstrate that these expert rubrics are not only better evaluation instruments but also highly effective training signals: training on approximately 1,000 COMPLEX-IF examples yields +15.5 pp improvement for a 4B-parameter model and +12.2 pp for a 235B-parameter model on instruction following, while single-epoch RL training on a rubric-graded enterprise environment produces gains that transfer to out-of-distribution benchmarks the model was never trained on (+4.5 pp BFCL, +7.4 pp τ 2-Bench, +6.8 pp Toolathlon). Our findings establish that expert-authored rubrics improve both the measurement and the development of frontier LLM capabilities, serving as effective evaluation and RL training signals.
C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning
Avni Mittal | Rauno Arike
Avni Mittal | Rauno Arike
Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, yet it remains unclear whether they can reliably assess process faithfulness rather than merely answer plausibility. We introduce C2-Faith, a benchmark built from PRM800K that explicitly decomposes faithfulness into two complementary dimensions: causality (whether each step logically follows from prior context) and coverage (whether essential intermediate inferences are present). Using controlled perturbations, we construct examples with known causal error positions by replacing a single step with a logically inconsistent variant, and with controlled coverage deletions at varying rates, enabling direct measurement against reference labels. We evaluate three frontier LLM judges across three tasks: binary causal detection, causal step localization, and coverage scoring. Our results reveal that judge reliability is highly task-dependent, with no single model dominating across settings. While models often detect that an error exists, they struggle to accurately localize it, indicating a substantial gap between detection and attribution. Moreover, all judges systematically overestimate reasoning completeness, assigning high coverage scores even when substantial portions of intermediate reasoning are missing. These findings expose fundamental limitations of LLM judges in process-level evaluation and highlight the need for more reliable and calibrated methods when using LLMs to assess reasoning quality.
Evaluating Multilingual Sentiment Classifiers Using an LLM-Annotated Wikipedia Benchmark
Milena Stróżyna | Włodzimierz Lewoniewski | Izabela Czumałowska
Milena Stróżyna | Włodzimierz Lewoniewski | Izabela Czumałowska
We present a multilingual study of sentiment evaluation on Wikipedia articles from various topics in five languages (German, English,Spanish, Polish, and Russian). In this paper, we compare three large language models (Gemini Pro 3.1, Claude Opus 4.6, and GPT 5.2),each queried three times per sentence, with two popular multilingual sentiment classifiers. This setup allows us to analyze not only inter-model differences but also intra-model stability as a proxy for confidence.To support systematic evaluation, we construct a benchmark dataset based on strict consensus across evaluators and analyze sentiment distributions across topics and languages. We show substantial variation in sentiment distributions, agreement, and consistency across models and languages. Our results suggest that sentiment evaluation on encyclopedic text remains an underexplored challenge for multilingual NLP.
Process Standardisation for Human Evaluation of NLP System Outputs
Craig Thomson | Javier González Corbelle | Anya Belz
Craig Thomson | Javier González Corbelle | Anya Belz
Human evaluation of NLP systems has high knowledge and effort thresholds. Researchers are often expected to design and run evaluations without formal training, while also creating the required resources from scratch. Recent work has started to address the knowledge threshold, but reusable tools that reduce effort remain limited. In this paper, we take a first step toward automated human-evaluation experiment creation by (i) surveying the processes and data resources used in a representative sample of current human evaluations in NLP, and (ii) deriving a canonical process model from these survey results, which (iii) provides a basis for standardised experiment design and automated toolkit development. The survey shows that recent human-evaluation practices are highly aligned in process structure, making reusable automation feasible.
Language Modeling for the Future of Finance: A Survey into Metrics, Tasks, and Data Opportunities
Nikita Tatarinov | Siddhant Sukhani | Agam Shah | Sudheer Chava
Nikita Tatarinov | Siddhant Sukhani | Agam Shah | Sudheer Chava
Recent advances in language modeling have led to a growing number of papers related to finance in top-tier Natural Language Processing (NLP) venues. To systematically examine this trend, we review 374 NLP research papers published between 2017 and 2024 across 38 conferences and workshops, with a focused analysis of 221 papers that directly address finance-related tasks. We evaluate these papers across 11 quantitative and qualitative dimensions, with particular attention to evaluation practices, metric choices, dataset coverage, and reproducibility in a high-stakes applied LM domain. Our study identifies the following opportunities for NLP researchers: (i) expanding the scope of forecasting tasks; (ii) enriching evaluation with finance-specific metrics; (iii) leveraging multilingual and crisis-period datasets for robustness-oriented evaluation; and (iv) balancing PLMs with efficient or interpretable alternatives. We identify actionable directions supported by dataset and tool recommendations, with implications for both academic evaluation practices and industry deployment.
Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 7K real user instructions for single-turn constrained text generation, exhibiting diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, extracted from natural user instructions. We categorize these constraints into eight high-level classes to capture their distribution and co-occurrence dynamics in real-world scenarios. Leveraging , we conduct extensive experiments to benchmark the instruction-following capabilities of leading LLMs. WildIFEval clearly differentiates between small and large models, and demonstrates that all models have room for improvement on such tasks. Our analysis reveals that as constraint count grows, models’ overall success drops sharply while per-constraint success remains stable, indicating a capacity bottleneck in juggling multiple constraints, and that models struggle more with rigid form-based constraints than with softer content-based ones. We release our dataset to promote further research on instruction-following under complex, realistic conditions.
EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments
Zefang Liu | Yinzhu Quan
Zefang Liu | Yinzhu Quan
We introduce EconWebArena, a benchmark for evaluating autonomous agents on complex, multimodal economic tasks in realistic web environments. The benchmark comprises 360 curated tasks from 82 authoritative websites spanning domains such as macroeconomics, labor, finance, trade, and public policy. Each task challenges agents to navigate live websites, interpret structured and visual content, interact with real interfaces, and extract precise, time-sensitive data through multi-step workflows. We construct the benchmark by prompting multiple large language models (LLMs) to generate candidate tasks, followed by rigorous human curation to ensure clarity, feasibility, and source reliability. Unlike prior work, EconWebArena emphasizes fidelity to authoritative data sources and the need for grounded web-based economic reasoning. We evaluate a diverse set of state-of-the-art multimodal LLMs as web agents, analyze failure cases, and conduct ablation studies to assess the impact of visual grounding, plan-based reasoning, and interaction design. Our results reveal substantial performance gaps and highlight persistent challenges in grounding, navigation, and multimodal understanding, positioning EconWebArena as a rigorous testbed for economic web intelligence.
ISO-Bench: Benchmarking Multimodal Causal Reasoning in Visual–Language Models through Procedural Plans
Ananya Sadana | Yash Kumar Lal | Jiawei Zhou
Ananya Sadana | Yash Kumar Lal | Jiawei Zhou
Understanding causal relationships across modalities is a core challenge for multimodal models operating in real-world environments. We introduce ISO-Bench, a benchmark for evaluating whether models can infer causal dependencies between visual observations and procedural text. Each example presents an image of a task step and a text snippet from a plan, with the goal of deciding whether the visual step occurs before or after the referenced text step. Evaluation results on ten frontier vision-language models show underwhelming performance: the best zero-shot F1 is only 0.57, and chain-of-thought reasoning yields only modest gains (up to 0.62 F1), largely behind humans (0.98 F1). Our analysis further highlights concrete directions for improving causal understanding in multimodal models.
Text Analytics Evaluation Framework: A Case Study on LLMs and Social Media
Yuefeng Shi | Nedjma Ousidhoum | Jose Camacho-Collados
Yuefeng Shi | Nedjma Ousidhoum | Jose Camacho-Collados
LLMs have demonstrated exceptional proficiency in a wide range of NLP tasks. However, a notable gap remains in practical data analysis scenarios, particularly when LLMs are required to process long sequences of unstructured documents, such as news feeds or, as specifically addressed in this paper, social media posts. To empirically assess the effectiveness of LLMs in this setting, we introduce a question-based evaluation framework comprising 470 manually curated questions designed to evaluate LLMs’ semantic understanding and reasoning abilities over aggregated text data. We apply our benchmark on diverse Twitter datasets covering various NLP tasks, including sentiment analysis, hate speech detection, and emotion recognition. Our results reveal that the performance depends heavily on input scale and the complexity of the data sources, declining noticeably in multi-label or target-dependent scenarios. In addition, as task complexity increases, performance drops progressively from basic semantic existence identification to more demanding operations such as comparison, counting, and calculation. Furthermore, as the input size grows beyond 500 instances, we identify a common limitation across LLMs, particularly Open-weights models: performance degrades substantially, especially on numerical tasks. These findings highlight critical architectural bottlenecks in current LLMs for performing rigorous quantitative analysis over large text collections.
Teaching Values to Machines: Simulating Human-Like Behavior in LLMs
Asaf Yehudai | Naama Rozen | Ariel Gera
Asaf Yehudai | Naama Rozen | Ariel Gera
Large Language Models (LLMs) demonstrate a remarkable capacity to adopt different personas and roles; however, it remains unclear whether they can manifest behavior that adheres to a coherent, human-like value structure. In this work, we draw on established psychological value theory to induce human-like values in LLMs and assess their alignment with patterns observed in human studies.Using validated psychological questionnaires, we conduct large-scale experiments – over 5 million questions – to evaluate value structures and value–behavior relationships in leading LLMs and compare them to humans. Our findings reveal strong agreement between value-prompted LLMs and humans across both dimensions. Moreover, incorporating human value distributions enhances population-level simulations with value-induced LLMs. These findings highlight the potential of value-induced LLMs as effective, psychologically grounded tools for simulating human behavior.
MetaGraph: A Large-Scale Meta-Analysis of GenAI in Financial NLP (2022–2025)
Paolo Pedinotti | Peter Baumann | Nathan Jessurun | Leslie Barrett | Enrico Santus
Paolo Pedinotti | Peter Baumann | Nathan Jessurun | Leslie Barrett | Enrico Santus
Financial NLP has evolved rapidly since late 2022, outpacing narrative surveys. We introduce MetaGraph, a methodology for extracting typed knowledge graphs from scientific corpora using ontology-guided LLM extraction to enable structured, large-scale trend analysis. Applied to 681 papers on GenAI in Finance (2022–2025), MetaGraph reveals three phases: early LLM-driven expansion of tasks and datasets, growing emphasis on limitations and risk, and a shift toward modular, system-oriented methods (e.g., retrieval-augmented designs). We release the resulting resource and artifacts to support reproducible meta-analysis and future monitoring of the field.
When Users Are Happy but Agents Are Wrong: Multi-Dimensional Evaluation of Tool-Augmented Dialogue
Tanya Shourya | Yingfan Wang | Zhaoyi Joey Hou | Shamik Roy | Vinayshekhar Bannihatti Kumar | Rashmi Gangadharaiah
Tanya Shourya | Yingfan Wang | Zhaoyi Joey Hou | Shamik Roy | Vinayshekhar Bannihatti Kumar | Rashmi Gangadharaiah
Evaluating conversational AI systems that use external tools is challenging, as errors can arise from complex interactions among user, agent, and tools. While existing evaluation methods assess either user satisfaction or agents’ tool-calling capabilities, they fail to capture critical errors in multi-turn tool-augmented dialogues—such as when agents misinterpret tool results yet appear satisfactory to users. We introduce TRACE, a benchmark of systematically synthesized tool-augmented conversations covering diverse error cases. Evaluation with state-of-the-art conversation evaluation frameworks reveals that all approaches remain far from ideal performance, demonstrating the fundamental difficulty of this benchmark.
Tool-Aware Planning for Contact-Center Analytics: Evaluating LLMs through Lineage-Guided Query Decomposition
Varun Nathan | Shreyas Guha | Ayush Kumar
Varun Nathan | Shreyas Guha | Ayush Kumar
We present a domain-grounded benchmark and evaluation framework for tool-aware plan generation in contact-center analytics, where answering a business-insights query requires decomposing it into executable steps over structured tools (Text2SQL over Snowflake), unstructured tools (RAG over transcripts), and LLM-based synthesis, with explicit depends_on relations for safe parallel execution. Our contributions are threefold: (i) a reference-based plan evaluation framework with two complementary views—a metric-wise evaluator spanning seven dimensions (e.g., tool–prompt alignment, query adherence) and a one-shot evaluator that compares a candidate plan against a reference plan; (ii) a lineage-driven data curation methodology that uses an iterative evaluator→optimizer loop to refine initial plans into high-quality plan lineages while reducing manual effort; and (iii) a large-scale study of 14 LLMs across model families and sizes on their ability to generate step-by-step, executable, tool-assigned plans, evaluated with and without lineage in the prompt. Empirically, LLMs continue to struggle on compound queries and on plans longer than four steps; the highest aggregate metric-wise score is 84.8 (Claude-3-7-Sonnet), while the strongest one-shot A+ rate (Extremely Good or Very Good) is only 49.75% (o3-mini). Lineage yields mixed overall gains but improves several strong models and often helps step executability. Overall, our results expose persistent weaknesses in tool understanding—especially tool–prompt alignment and tool-usage completeness—and show that shorter, simpler plans remain markedly easier. The benchmark, evaluation framework, and findings provide a practical path for assessing and improving agentic planning with tools in enterprise question-answering settings. An anonymized dataset with human-annotated reference plans, plan lineages, and per-planner outputs for all 14 planners is available at the anonymous repository linked in the paper.
TSAQA: Time Series Analysis Question And Answering Benchmark
Baoyu Jing | Sanhorn Chen | Lecheng Zheng | Boyu Liu | Zihao Li | Jiaru Zou | Tianxin Wei | Zhining Liu | Zhichen Zeng | Ruizhong Qiu | Xiao Lin | Yuchen Yan | Dongqi Fu | Jingchao Ni | Jingrui He | Hanghang Tong
Baoyu Jing | Sanhorn Chen | Lecheng Zheng | Boyu Liu | Zihao Li | Jiaru Zou | Tianxin Wei | Zhining Liu | Zhichen Zeng | Ruizhong Qiu | Xiao Lin | Yuchen Yan | Dongqi Fu | Jingchao Ni | Jingrui He | Hanghang Tong
Time series data are integral to applications across domains such as finance, healthcare, transportation, and environmental science.While recent work has begun to explore time series question answering (QA), existing benchmarks still provide limited coverage of analytical capabilities under a standardized evaluation framework. We introduce TSAQA, a novel unified benchmark designed to broaden task coverage and evaluate diverse temporal analysis capabilities. TSAQA integrates 6 diverse tasks under a single framework ranging fromconventional analysis, including anomaly detection and classification, to advanced analysis, such as characterization, comparison, datatransformation, and temporal relationship analysis. Spanning 210k samples across 13 domains, the dataset employs diverse formats, including true-or-false (TF), multiple-choice (MC), and a novel puzzling (PZ), to comprehensively assess time series analysis. Zero-shotevaluation shows that TSAQA remains challenging for current Large Language Models (LLMs): best-performing commercial model,Gemini-2.5-Flash, achieves 65.08 average accuracy. Although instruction tuning improves open-source models’ performance: the best-performing model, LLaMA-3.1-8B, shows significant room for improvement. We further evaluate language-capable time series foundation models (TSFMs), showing that TSAQA extends beyond general-purpose LLMs. The data are available in https://huggingface.co/datasets/TSAQA/TSAQA-Benchmark.
Who Endorsed It? Measuring Authority Bias Across Expertise Levels in Language Models
Priyanka Mary Mammen | Emil Joswin | Shankar Venkitachalam
Priyanka Mary Mammen | Emil Joswin | Shankar Venkitachalam
Prior research demonstrates that the performance of language models on reasoning tasks can be influenced by suggestions, hints, and endorsements. However, the influence of endorsement source credibility remains underexplored. We investigate whether language models exhibit systematic bias based on the perceived expertise of the provider of the endorsement. Across 4 datasets spanning mathematical, legal, and medical reasoning, we evaluate 11 models using personas representing four expertise levels per domain. Our results reveal that models are increasingly susceptible to incorrect or misleading endorsements as source expertise increases, with higher-authority sources inducing not only accuracy degradation but also increased confidence in wrong answers. We also show that this authority bias is mechanistically encoded within the model and a model can be steered away from the bias, thereby improving its performance even when an expert gives a misleading endorsement.
Reference Games as a Testbed for the Alignment of Model Uncertainty and Clarification Requests
Manar Ali | Judith Sieker | Sina Zarrieß | Hendrik Buschmeier
Manar Ali | Judith Sieker | Sina Zarrieß | Hendrik Buschmeier
In human conversation, both interlocutors play an active role in maintaining mutual understanding. When listeners are uncertain about what speakers mean, for example, they can request clarification. It is an open question for language models whether they can assume a similar listener role, recognizing and expressing their own uncertainty through clarification. We argue that reference games are a suitable testbed to approach this question as they are controlled, self-contained, and make clarification needs explicit and measurable. To test this, we evaluate three vision-language models comparing a baseline reference resolution task to an experiment where the models are instructed to request clarification when uncertain. The results suggest that even in such simple tasks, models often struggle to recognize internal uncertainty and translate it into adequate clarification behavior. This demonstrates the value of reference games as testbeds for interaction qualities of (vision and) language models.
Mapping Out the NLP Evaluation Landscape with a Standard Taxonomy of Quality Criteria
Anya Belz | Simon Mille | Craig Thomson
Anya Belz | Simon Mille | Craig Thomson
Prior research shows that when papers reportresults from system evaluations in terms ofa quality criterion such as Fluency, answersto two questions are normally less clear thanthey should be: (i) was it really Fluency thatwas evaluated; and (ii) was the same aspect ofquality evaluated as in other evaluations alsoclaiming to evaluate Fluency. Answers to thesequestions are crucial if meaningful conclusionsabout the Fluency of systems, independentlyand as compared to others, are to be drawn.We map a combined total of 1,002 individualevaluations identified in three surveys of 310NLP papers to the standardised QCET inven-tory of quality criterion names and definitions.Standardisation results in up to 76% reductionin evaluation criteria names, revealing a lot ofspurious difference in evaluation naming. Weargue that conclusions drawn from NLP sys-tem evaluations are only fully interpretable andcomparable if grounding in a standard inven-tory of quality criterion names and definitionsforms part of experiment design and reporting,and we propose a way of achieving this.
The critique of scalar benchmark rankings as proxies for model quality is now well-established (Raji et al., 2021; Wallach et al.,2025; Bean et al., 2025; Gehrmann et al., 2021). What the field still lacks is a shared structural vocabulary for comparing, combining, and contextualizing metric design choices. This paper provides that vocabulary: a four-primitive typology—representation (𝜙), comparison (D), aggregation (A), and context (C)—under which existing metrics (BLEU, BERTScore, nDCG, LLM-as-judge, calibration scores, agentic outcome measures) are explicit parameterizations of a common form. This typology is paired with a measurement–decision split: metrics are noisy estimators of latent constructs, and model selection is context-dependent Pareto optimization over construct estimates, not over raw scores. The typology makes implicit metric assumptions comparable and debatable rather than hidden inside a single number.
Position: What Are We Measuring? Rethinking Evaluation in Natural Language Generation
Wajdi Zaghouani
Wajdi Zaghouani
The field of natural language generation has accumulated a rich ecosystem of automatic evaluation metrics, yet it lacks a coherent theory of what those metrics are actually measuring. Drawing on measurement theory from the quantitative social sciences, this paper argues that current NLG evaluation practices suffer from a fundamental construct validity problem: metrics are treated as proxies for output quality without explicit specification of the underlying constructs they are meant to operationalize. We examine four dominant evaluation paradigms (reference-based metrics, embedding-based metrics, LLM-as-judge, and human evaluation) and demonstrate that each conflates construct definition with operationalization. Building on a long psychometric tradition reaching back to Cronbach and Meehl (1955) and on recent NLP work that has begun to apply this tradition to bias measurement, dialogue evaluation, and benchmark design, we propose that the field adopt a measurement modeling perspective for NLG evaluation. We borrow the concepts of construct validity, reliability, and consequential validity as a foundation for more principled evaluation, and we outline a preliminary taxonomy of NLG quality constructs as a starting point for this work.
Evaluation methodologies for language models increasingly combine multiple signals—automated metrics, LLM-as-judge ratings, human assessments, and benchmark suite results. When these signals are aggregated via averaging, the resulting evaluation confidence can substantially exceed the reliability of the weakest signal: a phenomenon we call trust inflation in evaluation. We argue that evaluation scores should be treated as epistemic claims with three properties: formality (human evaluation provides stronger evidence than an automated metric), scope (a benchmark result applies to the tested distribution, not universally), and validity windows (benchmark results expire as contamination accumulates and distributions shift). Drawing on several converging research traditions—chain-of-thought analysis, possibilistic logic, and algebraic theory—that establish weakest-link aggregation as the conservative endpoint of a parameterized operator family controlled by a single pessimism parameter, and on concrete lessons from building an evaluation harness for agentic AI, we propose that evaluation results carry explicit metadata—formality tier, scope declaration, and expiration date—to make their epistemic status transparent. We illustrate the cost of mean aggregation on the public HELM leaderboard: across 54 frontier models on ten scenarios, the top-five models ranked by mean score and by weakest-link are completely disjoint.
Position: A Semiotic-Hermeneutic Approach to Evaluating Meaning in LLM Summaries via the Inductive Conceptual Rating Metric
Natalie Perez | Sreyoshi Bhaduri | Aman Chadha
Natalie Perez | Sreyoshi Bhaduri | Aman Chadha
Meaning in human language is relational and context-dependent, and it emerges, according to Saussure (1916), through a dynamic system of signs rather than fixed relationships between words and concepts. Insights from the study of semiotics and hermeneutics emphasize that meaning arises through interpretive processes shaped by context, which has historically posed challenges for computational processing and evaluation. Building on these perspectives, this article advances an interdisciplinary framework for evaluating meaning in machine-generated language and introduces the Inductive Conceptual Rating (ICR) metric, a qualitative approach grounded in inductive content analysis and reflective thematic analysis that assesses semantic accuracy and meaning alignment in generative artificial intelligence (GenAI) outputs beyond surface-level lexical and similarity metrics. The ICR metric is applied in an empirical study that compares thematic summaries generated by the large language model (LLM) with the human-generated output in five datasets (N = 50-800). Results show that although models achieve high linguistic similarity scores, they consistently unperformed relative to human outputs in capturing recurring, contextually grounded meanings. This work concludes by discussing implications for meaning evaluation and future research.
Recent years have seen rapid growth in evaluation and benchmarking in NLP, driven by advances in large language models (LLMs). This growth has shifted evaluation from measuring generalization to tracking capability, often without reference to training assumptions. We argue that this creates a conceptual gap: results are frequently interpreted without considering what models could plausibly have learned, rendering many conclusions scientifically underdetermined. We propose an expectation-aware view, where the informativeness of evaluation depends on its relationship to training data, model design, and tasks. We further distinguish between evaluation for scientific understanding and capability tracking, and provide recommendations for aligning evaluation with its intended purpose in the LLM era.
The Shared Task on Reproducibility of Evaluations in NLP (ReproNLP) 2026: Overview and Results
Anya Belz | Craig Thomson | Javier González Corbelle
Anya Belz | Craig Thomson | Javier González Corbelle
We present the 2026 Shared Task on Reproducibility of Evaluations in NLP (ReproNLP’26) which followed on from five predecessor shared tasks on reproducibility of evaluations, ReproNLP’25, ReproNLP’24, ReproNLP’23, ReproGen’22 and ReproGen’21.This shared task series forms part of an ongoing research programme designed to develop theory and practice of reproducibility assessment in NLP and machine learning, against a backdrop of increasing recognition of the importance of the topic across the two fields. We describe the ReproNLP’26 shared task, summarise results from the reproduction studies submitted, and provide additional comparative analysis of their results.
Do Nugget-Based Evaluation Patterns Generalize to List-QA?
MohammadJavad Ardestani | Ehsan Kamalloo | Davood Rafiei
MohammadJavad Ardestani | Ehsan Kamalloo | Davood Rafiei
Evaluating long-form answers from retrieval-augmented generation (RAG) systems remains challenging: human evaluation is expensive, while automatic metrics must reliably capture answer completeness. The AutoNuggetizer framework addresses this by decomposing evaluation into atomic facts (nuggets) and using LLMs for both nugget creation and assignment. The original study validated this approach on open-ended TREC RAG queries; however, it remains unclear whether the same cost-quality tradeoffs hold for structurally different tasks. We reproduce AutoNuggetizer on seven RAG systems over the QAMPARI list-QA benchmark, where answers consist of discrete entities and omissions are more directly measurable. Our results directionally reproduce the original findings: fully automatic evaluation preserves run-level rankings, assignment-only automation yields stronger agreement than end-to-end automation, and LLM-based assignment is highly concordant with human labels while being modestly stricter. These findings support the use of AutoNuggetizer for comparative evaluation beyond open-ended RAG, while also identifying systematic biases in automatic nugget creation and assignment.
ReproNLP 2026: A Third Replication of the Human Evaluation of a QAG System for Children’s Storybooks
Marcel Mroczek | Chiara Albarello | Paul-Emmanuel Floch | Maciej Gawinecki
Marcel Mroczek | Chiara Albarello | Paul-Emmanuel Floch | Maciej Gawinecki
Abstract: Reproducibility of human evaluations in Natural Language Processing remains a critical open challenge. This paper presents a third independent replication of the human evaluation from Yao et al. (2022), which assessed an automated Question-Answer Generation (QAG) system for children’s storybooks against a baseline system and human-authored ground truth, across three criteria — Readability, Question Relevance, and Answer Relevance — using five NLP-literate annotators. Our replication confirms the main findings of the original study: the QAG system outperforms the baseline on Readability and Question Relevance, and Ground Truth ranks highest across all criteria. System rankings are preserved across all three criteria, with the exception of a statistically non-significant difference in Answer Relevance. This holds true despite a severe drop in inter-annotator agreement for Readability. We further document several methodological concerns, some unreported in prior replications, including data quality issues and evaluation design limitations identified during our pilot study.
In the context of the ReproNLP’26 shared task, I report on a single-criterion reproduction study of a human evaluation experiment for neuralreferring expression generation models (Castro Ferreira et al., 2018a), which has already been reproduced once by Mahamood (2024)for the ReproHum 2024 shared task. The experiments reported on in this paper therefore seek to second the findings from both previousexperiments.
ReproHum #0866-04: Variability in Human Judgments of Sociopolitical Acceptability Across Studies
Rui Fan | Guanyi Chen
Rui Fan | Guanyi Chen
Human evaluations are essential for assessing NLP systems, but their reproducibility can be limited when judgments involve socially sensitive constructs. This paper reproduces the perceived sociopolitical acceptability evaluation in (CITATION), where annotators judged whether model-generated writer-intent implications reflected mainstream or fringe viewpoints. Using the same 600 headline–belief pairs, we collected new annotations on Prolific and compared our results with both the original study and a prior reproduction. Our scores are lower than the original results. Under a 70% threshold, these findings do not support the original conclusion that most generations were socially acceptable. Overall, our results align more closely with the prior reproduction, while also showing substantial variability, especially for GPT2-large. We argue that this variability may arise from a combination of platform differences, task framing, topic effects, and changes in social context over time. These findings highlight the importance of reporting not only annotation results, but also the evaluation setting in which subjective social judgments are collected.
ReproHum #0031–01: Reproducing a Human Readability Evaluation for Question–Answer Generation Systems
Manuela Hürlimann | Mark Cieliebak
Manuela Hürlimann | Mark Cieliebak
Human evaluations play a central role in assessing natural language processing systems, yet their robustness and reproducibility remain incompletely understood. This paper reports on a reproduction of the human readability evaluation from Yao et al. (2022) for question–answer generation (QAG) systems, conducted within the ReproHum project and the ReproNLP 2026 shared task (Belz et al., 2026). The original evaluation compared three QAG systems with respect to three criteria. We reproduced the evaluation of one of these criteria, readability, using a new group of five evaluators. We report descriptive results, inter-annotator agreement, system-level comparisons, and cross-study robustness metrics compared to the original study and two previous reproductions. Our results support all conclusions of the original evaluation and are largely consistent with two previous reproductions.
ReproHum #0033-05: Human Evaluation Report on "Generating Scientific Definitions with Controllable Complexity"
Ines Arous | Jackie Chi Kit Cheung
Ines Arous | Jackie Chi Kit Cheung
Human evaluation remains a central component of assessing NLG systems, especially for open-ended or creative generation tasks. Yet, the field still lacks standardized practices for designing and reporting such evaluations. In this paper, we present a reproduction study of the human evaluation conducted by August et al. for their method of generating scientific definitions with controllable complexity. By closely replicating their experimental setup, we find that our results partially align with the original findings, suggesting a moderate level of reproducibility.
We describe our attempt to reproduce a single human evaluation quality criterion that was conducted in the paper “Reproducing a Recipe for Arbitrary Text Style Transfer with LLMs”. This paper describes the approach and challenges involved in reproducing the human evaluation as done by the original authors. In particular, we describe negative results obtained during the reproduction, and we compare our results with an earlier reproduction for the same experiment. Finally, we describe the insights we gained from attempting this particular reproduction and the barriers that remain in attempting successful reproductions. The results and insights presented will hopefully enable the broader NLP research community to improve both how human evaluations are conducted and enable better reproducibility of NLP experiments in the future.