Proceedings of the 5th Workshop on Evaluation and Comparison of NLP Systems

Mousumi Akter, Tahiya Chowdhury, Steffen Eger, Christoph Leiter, Juri Opitz, Erion Çano (Editors)


Anthology ID:
2025.eval4nlp-1
Month:
December
Year:
2025
Address:
Mumbai, India
Venues:
Eval4NLP | WS
SIG:
Publisher:
Association for Computational Linguistics
URL:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.eval4nlp-1/
DOI:
ISBN:
979-8-89176-305-0
Bib Export formats:
BibTeX
PDF:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.eval4nlp-1.pdf

pdf bib
Proceedings of the 5th Workshop on Evaluation and Comparison of NLP Systems
Mousumi Akter | Tahiya Chowdhury | Steffen Eger | Christoph Leiter | Juri Opitz | Erion Çano

pdf bib
Beyond Tokens and Into Minds: Future Directions for Human-Centered Evaluation in Machine Translation Post-Editing
Molly Apsel | Sunil Kothari | Manish Mehta | Vasudevan Sundarababu

Machine translation post-editing (MTPE) is central to evaluating and ensuring translation quality, particularly for low-resource languages (LRLs), where systems are more error-prone than for high-resource languages. Traditional token-based models segment text according to statistical patterns of their (primarily high-resource) training data, which can distort meaning, fragment words in morphologically rich languages, and complicate MTPE and evaluation. Current evaluation metrics also tend to emphasize surface-level similarity to reference texts, overlooking how humans actually approach translation tasks and creating issues when references are unavailable or a more abstract interpretation is needed. In this position paper, we argue that emerging architectures (Large Concept Models [LCMs] and Byte Latent Transformers [BLTs]) and insights from cognitive science open new possibilities for MTPE frameworks. LCMs represent meaning at the conceptual level, enabling evaluation of different translation approaches and the robustness of such models in MT. At the same time, BLTs operate below the token level, potentially easing post-editing across diverse language scripts. Drawing on cognitive theories of bilingualism and meaning representation, we outline hypotheses and research methods for evaluating post-editing data, translation quality, and interface design toward more robust, human-centered MT evaluation.

pdf bib
Measuring Visual Understanding in Telecom domain: Performance Metrics for Image-to-UML conversion using VLMs
H. G. Ranjani | Rutuja Prabhudesai

Telecom domain 3GPP documents are replete with images containing sequence diagrams. Advances in Vision-Language Large Models (VLMs) have eased conversion of such images to machine-readable PlantUML (puml) formats. However, there is a gap in evaluation of such conversions - existing works do not compare puml scripts for various components. In this work, we propose performance metrics to measure the effectiveness of such conversions. A subset of sequence diagrams from 3GPP documents is chosen to be representative of domain-specific actual scenarios. We compare puml outputs from two VLMs - Claude Sonnet and GPT-4V - against manually created ground truth representations. We use version control tools to capture differences and introduce standard performance metrics to measure accuracies along various components: participant identification, message flow accuracy, sequence ordering, and grouping construct preservation. We demonstrate effectiveness of proposed metrics in quantifying conversion errors across various components. The results show that nodes, edges and messages are accurately captured. However, we observe that VLMs do not necessarily perform well on complex structures such as notes, box, groups. Our experiments and performance metrics indicates a need for better representation of these components in training data for fine-tuned VLMs.

pdf bib
Simulating Training Data Leakage in Multiple-Choice Benchmarks for LLM Evaluation
Naila Shafirni Hidayat | Muhammad Dehan Al Kautsar | Alfan Farizki Wicaksono | Fajri Koto

The performance of large language models (LLMs) continues to improve, as reflected in rising scores on standard benchmarks. However, the lack of transparency around training data raises concerns about potential overlap with evaluation sets and the fairness of reported results. Although prior work has proposed methods for detecting data leakage, these approaches primarily focus on identifying outliers and have not been evaluated under controlled simulated leakage conditions. In this work, we compare existing leakage detection techniques, namely permutation and n-gram-based methods, under a continual pretraining setup that simulates real-world leakage scenarios, and additionally explore a lightweight method we call semi-half question. We further introduce two efficient extensions, permutation-R and permutation-Q. While semi-half offers a low-cost alternative, our analysis shows that the n-gram method consistently achieves the highest F1-Score, performing competitively with permutation-Q. We also refine these techniques to support instance-level detection and reduce computational overhead. Leveraging the best-performing method, we create cleaned versions of MMLU and HellaSwag, and re-evaluate several LLMs. Our findings present a practical path toward more reliable and transparent evaluations, and we recommend contamination checks as a standard practice before releasing benchmark results.

pdf bib
Reliable Inline Code Documentation with LLMs: Fine-Grained Evaluation of Comment Quality and Coverage
Rohan Patil | Gaurav Tirodkar | Shubham Gatfane

Code documentation plays a vital role in enhancing collaboration, maintainability, and comprehension throughout the software development lifecycle. This becomes especially critical in legacy codebases, where missing or outdated comments hinder effective debugging and onboarding. Among documentation types, inline comments are particularly valuable for conveying program logic and supporting code reuse. With the growing capabilities of large language models (LLMs), their application to tasks such as code understanding and summarization has gained significant attention in the NLP community. However, the specific task of generating high-quality inline code comments using LLMs remains relatively under-explored. In this work, we conduct a systematic evaluation of several state-of-the-art LLMs to assess their effectiveness in producing meaningful and context-aware inline documentation. To this end, we curate a dataset of well-documented code snippets and propose a fine-grained evaluation framework that assesses both the quality and sufficiency of generated comments at the statement level. We further investigate the impact of prompting strategies and offer a comparative analysis across a range of models, including large foundational LLMs to smaller, code-specialized variants, within the domain of inline code documentation. Our findings offer actionable insights that can guide the development of effective and scalable systems for automated inline code documentation.

pdf bib
Fair Play in the Newsroom: Actor-Based Filtering Gender Discrimination in Text Corpora
Stefanie Urchs | Veronika Thurner | Matthias Aßenmacher | Christian Heumann | Stephanie Thiemichen

Language corpora are the foundation of most natural language processing research, yet they often reproduce structural inequalities. One such inequality is gender discrimination in how actors are represented, which can distort analyses and perpetuate discriminatory outcomes. This paper introduces a user-centric, actor-level pipeline for detecting and mitigating gender discrimination in large-scale text corpora. By combining discourse-aware analysis with metrics for sentiment, syntactic agency, and quotation styles, our method enables both fine-grained auditing and exclusion-based balancing. Applied to the taz2024full corpus of German newspaper articles (1980–2024), the pipeline yields a more gender-balanced dataset while preserving core dynamics of the source material. Our findings show that structural asymmetries can be reduced through systematic filtering, though subtler biases in sentiment and framing remain. We release the tools and reports to support further research in discourse-based fairness auditing and equitable corpus construction.

pdf bib
Between the Drafts: An Evaluation Framework for Identifying Quality Improvement and Stylistic Differences in Scientific Texts
Danqing Chen | Ingo Weber | Felix Dietrich

This study explores the potential of a lightweight, open-source Large Language Model (LLM), demonstrating how its integration with Retrieval-Augmented Generation (RAG) can support cost-effective evaluation of revision quality and writing style differentiation. By retrieving reference documents from a carefully chosen and constructed corpus of peer-reviewed conference proceedings, our framework leverages few-shot in-context learning to track manuscript revisions and venue-specific writing styles. We demonstrate that the LLM-based evaluation aligns closely with human revision histories—consistently recognizing quality improvements across revision stages and distinguishing writing styles associated with different conference venues. These findings highlight how a carefully designed evaluation framework, integrated with adequate, representative data, can advance automated assessment of scientific writing.

pdf bib
“The dentist is an involved parent, the bartender is not”: Revealing Implicit Biases in QA with Implicit BBQ
Aarushi Wagh | Saniya Srivastava

Existing benchmarks evaluating biases in large language models (LLMs) primarily rely on explicit cues, declaring protected attributes like religion, race, gender by name. However, real-world interactions often contain implicit biases, inferred subtly through names, cultural cues, or traits. This critical oversight creates a significant blind spot in fairness evaluation. We introduce ImplicitBBQ, a benchmark extending the Bias Benchmark for QA (BBQ) with implicitly cued protected attributes across 6 categories. Our evaluation of GPT-4o on ImplicitBBQ illustrates troubling performance disparity from explicit BBQ prompts, with accuracy declining up to 7% in the “sexual orientation” subcategory and consistent decline located across most other categories. This indicates that current LLMs contain implicit biases undetected by explicit benchmarks. ImplicitBBQ offers a crucial tool for nuanced fairness evaluation in NLP.

pdf bib
SynClaimEval: A Framework for Evaluating the Utility of Synthetic Data in Long-Context Claim Verification
Mohamed Elaraby | Jyoti Prakash Maheswari

Large Language Models (LLMs) with extended context windows promise direct reasoning over long documents, reducing the need for chunking or retrieval. Constructing annotated resources for training and evaluation, however, remains costly. Synthetic data offers a scalable alternative, and we introduce SynClaimEval, a framework for evaluating synthetic data utility in long-context claim verification—a task central to hallucination detection and fact-checking. Our framework examines three dimensions: (i) input characteristics, by varying context length and testing generalization to out-of-domain benchmarks; (ii) synthesis logic, by controlling claim complexity and error type variation; and (iii) explanation quality, measuring the degree to which model explanations provide evidence consistent with predictions. Experiments across benchmarks show that long-context synthesis can improve verification in base instruction-tuned models, particularly when augmenting existing human-written datasets. Moreover, synthesis enhances explanation quality, even when verification scores don’t improve, underscoring its potential to strengthen both performance and explainability.

pdf bib
Evaluation of Generated Poetry
David Mareček | Kateřina Motalík Hodková | Tomáš Musil | Rudolf Rosa

We propose a range of automated metrics for evaluation of generated poetry.The metrics measure various aspects of poetry: rhyming, metre, syntax, semantics, and amount of unknown words.In a case study, we implement the metrics for Czech language, apply them to poetry generated by several automated systems as well as human-written, and correlate them with human judgment.We find that most of the proposed metrics correlate well with corresponding human evaluation, but semantically oriented metrics are much better predictors of the overall impression than metrics evaluating formal properties.

pdf bib
TitleTrap: Probing Presentation Bias in LLM-Based Scientific Reviewing
Shurui Du

Large language models (LLMs) are now used in scientific peer review, but their judgments can still be influenced by how information is presented. We study how the style of a paper’s title affects the way LLMs score scientific work. To control for content variation, we build the TitleTrap benchmark using abstracts generated by a language model for common research topics in computer vision and NLP. Each abstract is paired with three titles: a branded colon style, a plain descriptive style, and an interrogative style, while the abstract text remains fixed. We ask GPT-4o and Claude to review these title–abstract pairs under the same instructions. Our results show that title style alone can change the scores: branded titles often receive higher ratings, while interrogative titles sometimes lead to lower assessments of rigor. These findings reveal a presentation bias in LLM-based peer review and suggest the need for better methods to reduce such bias and support fairer automated evaluation.

pdf bib
Beyond the Rubric: Cultural Misalignment in LLM Benchmarks for Sexual and Reproductive Health
Sumon Kanti Dey | Manvi S | Zeel Mehta | Meet Shah | Unnati Agrawal | Suhani Jalota | Azra Ismail

Large Language Models (LLMs) have been positioned as having the potential to expand access to health information in the Global South, yet their evaluation remains heavily dependent on benchmarks designed around Western norms. We present insights from a preliminary benchmarking exercise with a chatbot for sexual and reproductive health (SRH) for an underserved community in India. We evaluated using HealthBench, a benchmark for conversational health models by OpenAI. We extracted 637 SRH queries from the dataset and evaluated on the 330 single-turn conversations. Responses were evaluated using HealthBench’s rubric-based automated grader, which rated responses consistently low. However, qualitative analysis by trained annotators and public health experts revealed that many responses were actually culturally appropriate and medically accurate. We highlight recurring issues, particularly a Western bias, such as for legal framing and norms (e.g., breastfeeding in public), diet assumptions (e.g., fish safe to eat during pregnancy), and costs (e.g., insurance models). Our findings demonstrate the limitations of current benchmarks in capturing the effectiveness of systems built for different cultural and healthcare contexts. We argue for the development of culturally adaptive evaluation frameworks that meet quality standards while recognizing needs of diverse populations.

pdf bib
Non-Determinism of “Deterministic” LLM System Settings in Hosted Environments
Berk Atıl | Sarp Aykent | Alexa Chittams | Lisheng Fu | Rebecca J. Passonneau | Evan Radcliffe | Guru Rajan Rajagopal | Adam Sloan | Tomasz Tudrej | Ferhan Ture | Zhe Wu | Lixinyu Xu | Breck Baldwin

LLM (large language model) users of hosted providers commonly notice that outputs can vary for the same inputs under settings expected to be deterministic. While it is difficult to get exact statistics, recent reports on specialty news sites and discussion boards suggest that among users in all communities, the majority of LLM usage today is through cloud-based APIs. Yet the questions of how pervasive non- determinism is, and how much it affects perfor- mance results, have not to our knowledge been systematically investigated. We apply five API- based LLMs configured to be deterministic to eight diverse tasks across 10 runs. Experiments reveal accuracy variations of up to 15% across runs, with a gap of up to 70% between best pos- sible performance and worst possible perfor- mance. No LLM consistently delivers the same outputs or accuracies, regardless of task. We speculate about the sources of non-determinism such as input buffer packing across multiple jobs. To better quantify our observations, we introduce metrics focused on quantifying de- terminism, TARr@N for the total agreement rate at N runs over raw output, and TARa@N for total agreement rate of parsed-out answers. Our code and data will be publicly available at https://github.com/Anonymous.

pdf bib
InFiNITE (): Indian Financial Narrative Inference Tasks & Evaluations
Sohom Ghosh | Arnab Maji | Sudip Kumar Naskar

This paper introduces Indian Financial Narrative Inference Tasks and Evaluations (InFiNITE), a comprehensive framework for analyzing India’s financial narratives through three novel inference tasks. Firstly, we present multi-modal earnings call analysis by integrating transcripts, presentation visuals, and market indicators via the Multi-Modal Indian Earnings Calls (MiMIC) dataset, enabling holistic prediction of post-call stock movements. Secondly, our Budget-Assisted Sectoral Impact Ranking (BASIR) dataset aids in systematically decoding government fiscal narratives by classifying budget excerpts into 81 economic sectors and evaluating their post-announcement equity performance. Thirdly, we introduce Bharat IPO Rating (BIR) datasets to redefine Initial Public Offering (IPO) evaluation through prospectus analysis, classifying potential investments into four recommendation categories (Apply, May Apply, Neutral, Avoid). By unifying textual, visual, and quantitative modalities across corporate, governmental, and public investment domains, InFiNITE addresses critical gaps in Indian financial narrative analysis. The open source data sets of the framework, including earnings calls, union budgets, and IPO prospectuses, establish benchmark resources specific to India for computational economic research under permissive licenses. For investors, InFiNITE enables data-driven identification of capital allocation opportunities and IPO risks, while policymakers gain structured insights to assess Indian fiscal communication impacts. By releasing these datasets publicly, we aim to facilitate research in computational economics and financial text analysis, particularly for the Indian market.

pdf bib
Test Set Quality in Multilingual LLM Evaluation
Chalamalasetti Kranti | Gabriel Bernier-Colborne | Yvan Gauthier | Sowmya Vajjala

Several multilingual benchmark datasets have been developed in a semi-automatic manner in the recent past to measure progress and understand the state-of-the-art in the multilingual capabilities of Large Language Models (LLM). However, there is not a lot of attention paid to the quality of the datasets themselves, despite the existence of previous work in identifying errors in even fully human-annotated test sets. In this paper, we manually analyze recent multilingual evaluation sets in two languages – French and Telugu, identifying several errors in the datasets during the process. We compare the performance difference across several LLMs with the original and revised versions of the datasets and identify large differences (almost 10% in some cases) in both languages. Based on these results, we argue that test sets should not be considered immutable and should be revisited, checked for correctness, and potentially versioned. We end with some recommendations for both the dataset creators as well as consumers on addressing the dataset quality issues.