Proceedings of the 18th International Natural Language Generation Conference

Lucie Flek, Shashi Narayan, Lê Hồng Phương, Jiahuan Pei (Editors)

Anthology ID:: 2025.inlg-main
Month:: October
Year:: 2025
Address:: Hanoi, Vietnam
Venue:: INLG
SIG:: SIGGEN
Publisher:: Association for Computational Linguistics
URL:: https://preview.aclanthology.org/ingest-luhme/2025.inlg-main/
DOI:
Bib Export formats:: BibTeX
PDF:: https://preview.aclanthology.org/ingest-luhme/2025.inlg-main.pdf

pdf bib
Proceedings of the 18th International Natural Language Generation Conference
Lucie Flek | Shashi Narayan | Lê Hồng Phương | Jiahuan Pei

pdf bib abs
Enhancing Coherence and Interestingness in Knowledge-Grounded Dialogue Generation
Hiroki Onozeki | Michimasa Inaba

Open-domain dialogue systems have been increasingly applied in various situations, with a growing need to improve user engagement. One effective approach is to generate responses based on interesting external knowledge using knowledge-grounded response generation models. However, relying solely on interestingness can lead to incoherent responses, potentially diminishing user engagement. This paper proposes a novel method for generating engaging responses while maintaining contextual coherence. Our approach leverages a pre-trained knowledge-grounded response generation model and modifies the knowledge selection process to enhance response coherence and interestingness without requiring additional training. First, knowledge candidates with high contextual relevance are retrieved. These candidates are then reranked based on their interestingness and used to generate the responses. Finally, the method detects dialogue breakdowns and regenerates responses as necessary to ensure coherence. We conducted experiments using the Wizard of Wikipedia dataset and two state-of-the-art response generation models. The results indicate that the proposed method improves both response coherence and interestingness.

pdf bib abs
Evaluating LLM-Generated Versus Human-Authored Responses in Role-Play Dialogues
Dongxu Lu | Johan Jeuring | Albert Gatt

Evaluating large language models (LLMs) in long-form, knowledge-grounded role-play dialogues remains challenging. This study compares LLM-generated and human-authored responses in multi-turn professional training simulations through human evaluation (N = 38) and automated LLM-as-a-judge assessment. Human evaluation revealed significant degradation in LLM-generated response quality across turns, particularly in naturalness, context maintenance and overall quality, while human-authored responses progressively improved. In line with this finding, participants also indicated a consistent preference for human-authored dialogue. These human judgements were validated by our automated LLM-as-a-judge evaluation, where GEMINI 2.0 FLASH achieved strong alignment with human evaluators on both zero-shot pairwise preference and stochastic 6-shot construct ratings, confirming the widening quality gap between LLM and human responses over time. Our work contributes a multi-turn benchmark exposing LLM degradation in knowledge-grounded role-play dialogues and provides a validated hybrid evaluation framework to guide the reliable integration of LLMs in training simulations.

We take first steps in exploring whether Large Language Models (LLMs) can be adapted to dialogic learning practices, specifically pair programming — LLMs have primarily been implemented as programming assistants, not fully exploiting their dialogic potential. We used new dialogue data from real pair-programming interactions between students, prompting state-of-the-art LLMs to assume the role of a student, when generating a response that continues the real dialogue. We asked human annotators to rate human and AI responses on the criteria through which we operationalise the LLMs’ suitability for educational dialogue: Coherence, Collaborativeness, and whether they appeared human. Results show model differences, with Llama-generated responses being rated similarly to human answers on all three criteria. Thus, for at least one of the models we investigated, the LLM utterance-level response generation appears to be suitable for pair-programming dialogue.

Hallucinations are one of the most pressing challenges for large language models (LLMs). While numerous methods have been proposed to detect and mitigate them automatically, human evaluation continues to serve as the gold standard. However, these human evaluations of hallucinations show substantial variation in definitions, terminology, and evaluation practices. In this paper, we survey 64 studies involving human evaluation of hallucination published between 2019 and 2024, to investigate how hallucinations are currently defined and assessed. Our analysis reveals a lack of consistency in definitions and exposes several concerning methodological shortcomings. Crucial details, such as evaluation guidelines, user interface design, inter-annotator agreement metrics, and annotator demographics, are frequently under-reported or omitted altogether.

Counterfactual examples are widely employed to enhance the performance and robustness of large language models (LLMs) through counterfactual data augmentation (CDA). However, the selection of the judge model used to evaluate label flipping, the primary metric for assessing the validity of generated counterfactuals for CDA, yields inconsistent results. To decipher this, we define four types of relationships between the counterfactual generator and judge models: being the same model, belonging to the same model family, being independent models, and having an distillation relationship. Through extensive experiments involving two state-of-the-art LLM-based methods, three datasets, four generator models, and 15 judge models, complemented by a user study (n = 90), we demonstrate that judge models with an independent, non-fine-tuned relationship to the generator model provide the most reliable label flipping evaluations. Relationships between the generator and judge models, which are closely aligned with the user study for CDA, result in better model performance and robustness. Nevertheless, we find that the gap between the most effective judge models and the results obtained from the user study remains considerably large. This suggests that a fully automated pipeline for CDA may be inadequate and requires human intervention.

pdf bib abs
Assessing Semantic Consistency in Data‐to‐Text Generation: A Meta-Evaluation of Textual, Semantic and Model-Based Metrics
Rudali Huidrom | Michela Lorandi | Simon Mille | Craig Thomson | Anya Belz

Ensuring semantic consistency between semantic-triple inputs and generated text is crucial in data‐to‐text generation, but continues to pose challenges both during generation and in evaluation. In order to assess how accurately semantic consistency can currently be assessed, we meta-evaluate 29 different evaluation methods in terms of their ability to predict human semantic-consistency ratings. The evaluation methods include embeddings‐based, overlap‐based, and edit‐distance metrics, as well as learned regressors and a prompted ‘LLM‐as‐judge’ protocol. We meta-evaluate on two datasets: the WebNLG 2017 human evaluation dataset, and a newly created WebNLG-style dataset that none of the methods can have seen during training. We find that none of the traditional textual similarity metrics or the pre-Transformer model-based metrics are suitable for the task of semantic consistency assessment. LLM-based methods perform well on the whole, but best correlations with human judgments still lag behind those seen in other text generation tasks.

pdf bib abs
FreshTab: Sourcing Fresh Data for Table-to-Text Generation Evaluation
Kristýna Onderková | Ondrej Platek | Zdeněk Kasner | Ondrej Dusek

Table-to-text generation (insight generation from tables) is a challenging task that requires precision in analyzing the data. In addition, the evaluation of existing benchmarks is affected by contamination of Large Language Model (LLM) training data as well as domain imbalance. We introduce FreshTab, an on-the-fly table-to-text benchmark generation from Wikipedia, to combat the LLM data contamination problem and enable domain-sensitive evaluation. While non-English table-to-text datasets are limited, FreshTab collects datasets in different languages on demand (we experiment with German, Russian and French in addition to English). We find that insights generated by LLMs from recent tables collected by our method appear clearly worse by automatic metrics, but this does not translate into LLM and human evaluations. Domain effects are visible in all evaluations, showing that a domain-balanced benchmark is more challenging.

pdf bib abs
KDA: Knowledge Distillation Adapter for Cross-Lingual Transfer
Ta-Bao Nguyen | Nguyen-Phuong Phan | Tung Le | Huy Tien Nguyen

State-of-the-art cross-lingual transfer often relies on massive multilingual models, but their prohibitive size and computational cost limit their practicality for low-resource languages. An alternative is to adapt powerful, task-specialized monolingual models, but this presents challenges in bridging the vocabulary and structural gaps between languages. To address this, we propose KDA, a Knowledge Distillation Adapter framework that efficiently adapts a fine-tuned, high-resource monolingual model to a low-resource target language. KDA utilizes knowledge distillation to transfer the source model’s task-solving capabilities to the target language in a parameter-efficient manner. In addition, we introduce a novel adapter architecture that integrates source-language token embeddings while learning new positional embeddings, directly mitigating cross-lingual representational mismatches. Our empirical results on zero-shot transfer for Vietnamese Sentiment Analysis demonstrate that KDA significantly outperforms existing methods, offering a new, effective, and computationally efficient pathway for cross-lingual transfer.

pdf bib abs
ViNumFCR: A Novel Vietnamese Benchmark for Numerical Reasoning Fact Checking on Social Media News
Nhi Ngoc Phuong Luong | Anh Thi Lan Le | Tin Van Huynh | Kiet Van Nguyen | Ngan Nguyen

In the digital era, the internet provides rapid and convenient access to vast amounts of information. However, much of this information remains unverified, particularly with the increasing prevalence of falsified numerical data, leading to public confusion and negative societal impacts. To address this issue, we developed ViNumFCR, a first dataset dedicated to fact-checking numerical information in Vietnamese. Comprising over 10,000 samples collected and constructed from online newspaper across 12 different topics. We assessed the performance of various fact-checking models, including Pretrained Language Models and Large Language Models, alongside retrieval techniques for gathering supporting evidence. Experimental results demonstrate that the XLM-R_Large model achieved the highest accuracy of 90.05% on the fact-checking task, while the combined SBERT + BM25 model attained a precision of over 97% on the evidence retrieval task. Additionally, we conducted an in-depth analysis of the linguistic features of the dataset to understand the factors influencing the performance models. The ViNumFCR dataset is publicly available to support further research.

pdf bib abs
Dual Debiasing: Remove Stereotypes and Keep Factual Gender for Fair Language Modeling and Translation
Tomasz Limisiewicz | David Mareček | Tomáš Musil

Mitigation of biases, such as language models’ reliance on gender stereotypes, is a crucial endeavor required for the creation of reliable and useful language technology. The crucial aspect of debiasing is to ensure that the models preserve their versatile capabilities, including their ability to solve language tasks and equitably represent various genders. To address these issues, we introduce Dual Dabiasing Algorithm through Model Adaptation (2DAMA). Novel Dual Debiasing enables robust reduction of stereotypical bias while preserving desired factual gender information encoded by language models. We show that 2DAMA effectively reduces gender bias in language models for English and is one of the first approaches facilitating the mitigation of their stereotypical tendencies in translation. The proposed method’s key advantage is the preservation of factual gender cues, which are useful in a wide range of natural language processing tasks.

pdf bib abs
Mining Contextualized Visual Associations from Images for Creativity Understanding
Ananya Sahu | Amith Ananthram | Kathleen McKeown

Understanding another person’s creative output requires a shared language of association. However, when training vision-language models such as CLIP, we rely on web-scraped datasets containing short, predominantly literal, alt-text. In this work, we introduce a method for mining contextualized associations for salient visual elements in an image that can scale to any unlabeled dataset. Given an image, we can use these mined associations to generate high quality creative captions at increasing degrees of abstraction. With our method, we produce a new dataset of visual associations and 1.7m creative captions for the images in MSCOCO. Human evaluation confirms that these captions remain visually grounded while exhibiting recognizably increasing abstraction. Moreover, fine-tuning a visual encoder on this dataset yields meaningful improvements in zero-shot image-text retrieval in two creative domains: poetry and metaphor visualization. We release our dataset, our generation code and our models for use by the broader community.

pdf bib abs
Analysing Reference Production of Large Language Models
Chengzhao Wu | Guanyi Chen | Fahime Same | Tingting He

This study investigates how large language models (LLMs) produce referring expressions (REs) and to what extent their behaviour aligns with human patterns. We evaluate LLM performance in two settings: slot filling, %KvD the conventional task of referring expression generation, where REs are generated within a fixed context, and language generation, where REs are analysed within fully generated texts. Using the WebNLG corpus, we assess how well LLMs capture human variation in reference production and analyse their behaviour by examining the influence of several factors known to affect human reference production, including referential form, syntactic position, recency, and discourse status. Our findings show that (1) task framing significantly affects LLMs’ reference production; (2) while LLMs are sensitive to some of these factors, their referential behaviour consistently diverges from human use; and (3) larger model size does not necessarily yield more human-like variation. These results underscore key limitations in current LLMs’ ability to replicate human referential choices.

pdf bib abs
Live Football Commentary (LFC): A Large‐Scale Dataset for Building Football Commentary Generation Models
Taiga Someya | Tatsuya Ishigaki | Hiroya Takamura

Live football commentary brings the atmosphere and excitement of matches to fans in real time, but producing it requires costly professional announcers. We address this challenge by formulating commentary generation from player- and ball-tracking coordinates as a new language–generation task. To facilitate research on this problem we compile the Live Football Commentary (LFC) dataset, 12,440 time-stamped Japanese utterances aligned with tracking data for 40 J1 League matches ( 60 h). We benchmark three LLM-based baselines that receive the tracking data (i) as plain text, (ii) as pitch-map images, or (iii) in both modalities. Human evaluation shows that the text encoding already outperforms image and multimodal variants in both accuracy and relevance, indicating that current LLMs exploit structured coordinates more effectively than raw visuals. We release the LFC transcripts and evaluation code to establish a public test bed and spur future work on tracking-based commentary generation, saliency detection, and cross-modal integration.

pdf bib abs
Exploring the Power of Large Language Models for Vietnamese Implitcit Sentiment Analysis
Huy Gia Luu | Dang Van Thin

We present the first benchmark for implicit sentiment analysis (ISA) in Vietnamese, aimed at evaluating large language models (LLMs) on their ability to interpret implicit sentiment accompanied by ViISA, a dataset specifically constructed for this task. We assess a variety of open-source and close-source LLMs using state-of-the-art (SOTA) prompting techniques. While LLMs achieve strong recall, they often misclassify implicit cues such as sarcasm and exaggeration, resulting in low precision. Through detailed error analysis, we highlight key challenges and suggest improvements to Chain-of-Thought prompting via more contextually aligned demonstrations.

pdf bib abs
Towards Trustworthy Lexical Simplification: Exploring Safety and Efficiency with Small LLMs
Akio Hayakawa | Stefan Bott | Horacio Saggion

Despite their strong performance, large language models (LLMs) face challenges in real-world application of lexical simplification (LS), particularly in privacy-sensitive and resource-constrained environments. Moreover, since vulnerable user groups (e.g., people with disabilities) are one of the key target groups of this technology, it is crucial to ensure the safety and correctness of the output of LS systems. To address these issues, we propose an efficient framework for LS systems that utilizes small LLMs deployable in local environments. Within this framework, we explore knowledge distillation with synthesized data and in-context learning as baselines. Our experiments in five languages evaluate model outputs both automatically and manually. Our manual analysis reveals that while knowledge distillation boosts automatic metric scores, it also introduces a safety trade-off by increasing harmful simplifications. Importantly, we find that the model’s output probability is a useful signal for detecting harmful simplifications. Leveraging this, we propose a filtering strategy that suppresses harmful simplifications while largely preserving beneficial ones. This work establishes a benchmark for efficient and safe LS with small LLMs. It highlights the key trade-offs between performance, efficiency, and safety, and demonstrates a promising approach for safe real-world deployment.

Data-to-text generation tasks often involve processing numerical time-series as input such as financial statistics or meteorological data. Although large language models (LLMs) are a powerful approach to data-to-text, we still lack a comprehensive understanding of how well they actually understand time-series data. We therefore introduce a benchmark with 18 evaluation tasks to assess LLMs’ abilities of interpreting numerical time-series, which are categorized into: 1) event detection—identifying maxima and minima; 2) computation—averaging and summation; 3) pairwise comparison—comparing values over time; and 4) inference—imputation and forecasting. Our experiments reveal five key findings: 1) even state-of-the-art LLMs struggle with complex multi-step reasoning; 2) tasks that require extracting values or performing computations within a specified range of the time-series significantly reduce accuracy; 3) instruction tuning offers inconsistent improvements for numerical interpretation; 4) reasoning-based models outperform standard LLMs in complex numerical tasks; and 5) LLMs perform interpolation better than forecasting. These results establish a clear baseline and serve as a wake-up call for anyone aiming to blend fluent language with trustworthy numeric precision in time-series scenarios.

pdf bib abs
Can GPT models Follow Human Summarization Guidelines? A Study for Targeted Communication Goals
Yongxin Zhou | Fabien Ringeval | François Portet

This study investigates the ability of GPT models (ChatGPT, GPT-4 and GPT-4o) to generate dialogue summaries that adhere to human guidelines. Our evaluation involved experimenting with various prompts to guide the models in complying with guidelines on two datasets: DialogSum (English social conversations) and DECODA (French call center interactions). Human evaluation, based on summarization guidelines, served as the primary assessment method, complemented by extensive quantitative and qualitative analyses. Our findings reveal a preference for GPT-generated summaries over those from task-specific pre-trained models and reference summaries, highlighting GPT models’ ability to follow human guidelines despite occasionally producing longer outputs and exhibiting divergent lexical and structural alignment with references. The discrepancy between ROUGE, BERTScore, and human evaluation underscores the need for more reliable automatic evaluation metrics.

Human language production exhibits remarkable richness and variation, reflecting diverse communication styles and intents. However, this variation is often overlooked in summarization evaluation. While having multiple reference summaries is known to improve correlation with human judgments, the impact of the reference set on reference-based metrics has not been systematically investigated. This work examines the sensitivity of widely used reference-based metrics in relation to the choice of reference sets, analyzing three diverse multi-reference summarization datasets: SummEval, GUMSum, and DUC2004. We demonstrate that many popular metrics exhibit significant instability. This instability is particularly concerning for n-gram-based metrics like ROUGE, where model rankings vary depending on the reference sets, undermining the reliability of model comparisons. We also collect human judgments on LLM outputs for genre-diverse data and examine their correlation with metrics to supplement existing findings beyond newswire summaries, finding weak-to-no correlation. Taken together, we recommend incorporating reference set variation into summarization evaluation to enhance consistency alongside correlation with human judgments, especially when evaluating LLMs.

pdf bib abs
OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs
Ivan Kartac | Mateusz Lango | Ondrej Dusek

Large Language Models (LLMs) have demonstrated great potential as evaluators of NLG systems, allowing for high-quality, reference-free, and multi-aspect assessments. However, existing LLM-based metrics suffer from two major drawbacks: reliance on proprietary models to generate training data or perform evaluations, and a lack of fine-grained, explanatory feedback. We introduce OpeNLGauge, a fully open-source, reference-free NLG evaluation metric that provides accurate explanations based on individual error spans. OpeNLGauge is available as a two-stage ensemble of larger open-weight LLMs, or as a small fine-tuned evaluation model, with confirmed generalizability to unseen tasks, domains and aspects. Our extensive meta-evaluation shows that OpeNLGauge achieves competitive correlation with human judgments, outperforming state-of-the-art models on certain tasks while maintaining full reproducibility and providing explanations more than twice as accurate.

pdf bib abs
Statistical Multicriteria Evaluation of LLM-Generated Text
Esteban Garces Arias | Hannah Blocher | Julian Rodemann | Matthias Assenmacher | Christoph Jansen

Assessing the quality of LLM-generated text remains a fundamental challenge in natural language processing. Current evaluation approaches often rely on isolated metrics or simplistic aggregations that fail to capture the nuanced trade-offs between coherence, diversity, fluency, and other relevant indicators of text quality. In this work, we adapt a recently proposed framework for statistical inference based on Generalized Stochastic Dominance (GSD) that addresses three critical limitations in existing benchmarking methodologies: the inadequacy of single-metric evaluation, the incompatibility between cardinal automatic metrics and ordinal human judgments, and the lack of inferential statistical guarantees. The GSD-front approach enables simultaneous evaluation across multiple quality dimensions while respecting their different measurement scales, building upon partial orders of decoding strategies, thus avoiding arbitrary weighting of the involved metrics. By applying this framework to evaluate common decoding strategies against human-generated text, we demonstrate its ability to identify statistically significant performance differences while accounting for potential deviations from the i.i.d. assumption of the sampling design.

pdf bib abs
Incorporating Formulaicness in the Automatic Evaluation of Naturalness: A Case Study in Logic-to-Text Generation
Eduardo Calò | Guanyi Chen | Elias Stengel-Eskin | Albert Gatt | Kees van Deemter

Data-to-text natural language generation (NLG) models may produce outputs that closely mirror the structure of their input. We introduce formulaicness as a measure of the output-to-input structural resemblance, proposing it as an enhancement for reference-less naturalness evaluation. Focusing on logic-to-text generation, we construct a dataset and train a regressor to predict formulaicness scores. We collect human judgments on naturalness and examine how incorporating formulaicness into existing metrics affects alignment with these judgments.

pdf bib abs
Surprisal reveals diversity gaps in image captioning and different scorers change the story
Nikolai Ilinykh | Simon Dobnik

We quantify linguistic diversity in image captioning with surprisal variance – the spread of token-level negative log-probabilities within a caption set. On the MSCOCO test set, we compare five state-of-the-art vision-and-language LLMs, decoded with greedy and nucleus sampling, to human captions. Measured with a caption-trained n-gram LM, humans display roughly twice the surprisal variance of models, but rescoring the same captions with a general-language model reverses the pattern. Our analysis introduces the surprisal-based diversity metric for image captioning. We show that relying on a single scorer can completely invert conclusions, thus, robust diversity evaluation must report surprisal under several scorers.

pdf bib abs
Natural Language Translation of Formal Proofs through Informalization of Proof Steps and Recursive Summarization along Proof Structure
Seiji Hattori | Takuya Matsuzaki | Makoto Fujiwara

This paper proposes a natural language translation method for machine-verifiable formal proofs that leverages the informalization (verbalization of formal language proof steps) and summarization capabilities of LLMs. For evaluation, it was applied to formal proof data created in accordance with natural language proofs taken from an undergraduate-level textbook, and the quality of the generated natural language proofs was analyzed in comparison with the original natural language proofs. Furthermore, we will demonstrate that this method can output highly readable and accurate natural language proofs by applying it to existing formal proof library of the Lean proof assistant.

The ability to paraphrase texts across different complexity levels is essential for creating accessible texts that can be tailored toward diverse reader groups. Thus, we introduce German4All, the first large-scale German dataset of aligned readability-controlled, paragraph-level paraphrases. It spans five readability levels and comprises over 25,000 samples. The dataset is automatically synthesized using GPT-4 and rigorously evaluated through both human and LLM-based judgments. Using German4All, we train an open-source, readability-controlled paraphrasing model that achieves state-of-the-art performance in German text simplification, enabling more nuanced and reader-specific adaptations. We open-source both the dataset and the model to encourage further research on multi-level paraphrasing.

pdf bib abs
Enhancing Named Entity Translation from Classical Chinese to Vietnamese in Traditional Vietnamese Medicine Domain: A Hybrid Masking and Dictionary-Augmented Approach
Nhu Vo Quynh Pham | Uyen Bao Nguyen Phuc | Long Hong Buu Nguyen | Dien Dinh

Vietnam’s traditional medical texts were historically written in Classical Chinese using Sino-Vietnamese pronunciations. As the Vietnamese language transitioned to a Latin-based national script and interest in integrating traditional medicine with modern healthcare grows, accurate translation of these texts has become increasingly important. However, the diversity of terminology and the complexity of translating medical entities into modern contexts pose significant challenges. To address this, we propose a method that fine-tunes large language models (LLMs) using augmented data and a Hybrid Entity Masking and Replacement (HEMR) strategy to improve named entity translation. We also introduce a parallel named entity translation dataset specifically curated for traditional Vietnamese medicine. Our evaluation across multiple LLMs shows that the proposed approach achieves a translation accuracy of 71.91%, demonstrating its effectiveness. These results underscore the importance of incorporating named entity awareness into translation systems, particularly in low-resource and domain-specific settings like traditional Vietnamese medicine.

pdf bib abs
Fine-Tuning, Prompting and RAG for Knowledge Graph-to-Russian Text Generation. How do these Methods generalise to Out-of-Distribution Data?
Anna Nikiforovskaya | William Eduardo Soto Martinez | Evan Parker Kelly Chapple | Claire Gardent

Prior work on Knowledge Graph-to-Text generation has mostly evaluated models on in-domain test sets and/or with English as the target language. In contrast, we focus on Russian and we assess how various generation methods perform on out-of-domain, unseen data. Previous studies have shown that enriching the input with target-language verbalisations of entities and properties substantially improves the performance of fine-tuned models for Russian. We compare multiple variants of two contemporary paradigms — LLM prompting and Retrieval-Augmented Generation (RAG) — and investigate alternative ways to integrate such external knowledge into the generation process. Using automatic metrics and human evaluation, we find that on unseen data the fine-tuned model consistently underperforms, revealing limited generalisation capacity; that while it outperforms RAG by a small margin on most datasets, prompting generates less fluent text; and conversely, that RAG generates text that is less faithful to the input. Overall, both LLM prompting and RAG outperform Fine-Tuning across all unseen testsets. The code for this paper is available at https://github.com/Javanochka/KG-to-text-fine-tuning-prompting-rag

pdf bib abs
FinStat2SQL: A Text2SQL Pipeline for Financial Statement Analysis
Hung Quang Nguyen | Anh Phuong Trinh | Hung Phan Quoc Mai | Phong Tuan Trinh

Despite recent advances in LLMs, Text2SQL remains challenging for complex, domain-specific queries such as finance, where database designs and reporting standards vary widely. We introduce FinStat2SQL, a lightweight pipeline enabling natural language queries over financial statements, tailored to local standards like VAS. Our multi-agent setup combines large and small language models for entity extraction, SQL generation, and self-correction, and includes a fully automatic pipeline for synthetic data generation. Leveraging this synthetic data, we fine-tuned a 7B model that achieves 61.33% accuracy with sub-4s latency on consumer hardware, outperforming GPT-4o-mini on SQL generation. FinStat2SQL provides a scalable, cost-efficient solution for financial analysis. We made our source code publicly available at: https://github.com/hung20gg/chatbot_financial_statement.

pdf bib abs
From Prototypical to Relational: How LLMs Navigate Complex Analogies
Mayukh Das | Wolf-Tilo Balke

We introduce a comprehensive benchmark to assess the analogical reasoning capabilities of large language models (LLMs) on complex analogy tasks that go beyond conventional formats with single correct answers. Unlike standard benchmarks that assume a singular ground truth, our framework presents a four-way multiple-choice analogy task in which all target options are semantically plausible. Leveraging concept pairs from Wikidata and AnalogyKB, we construct analogy instances enriched with multiple overlapping relational structures, where the relations are mined with RAG and ranked in salience through a GPT-4-assisted Max-Diff survey. To enable systematic evaluation, we propose three complementary semantic measures i.e. ranked relational overlap, context embedding similarity, and prototypicality; each grounded in established literature on analogical reasoning. Our experiments span a range of LLMs, evaluated under zero-shot, few-shot, and knowledge-enhanced prompting conditions. While models such as GPT-4 perform well on embedding-based and prototypicality-based measures, they consistently underperform when tasked with capturing fine-grained relational mappings. These results reveal that, despite their impressive surface-level semantic fluency, current LLMs exhibit notable limitations in structured relational reasoning.

pdf bib abs
Automated and Context-Aware Code Documentation Leveraging Advanced LLMs
Swapnil Sharma Sarker | Tanzina Taher Ifty

Code documentation is essential to improve software maintainability and comprehension. The tedious nature of manual code documentation has led to much research on automated documentation generation. Existing automated approaches primarily focused on code summarization, leaving a gap in template-based documentation generation (e.g., Javadoc), particularly with publicly available Large Language Models (LLMs). Furthermore, progress in this area has been hindered by the lack of a Javadoc-specific dataset that incorporates modern language features, provides broad framework/library coverage, and includes necessary contextual information. This study aims to address these gaps by developing a tailored dataset and assessing the capabilities of publicly available LLMs for context-aware, template-based Javadoc generation. In this work, we present a novel, context-aware dataset for Javadoc generation that includes critical structural and semantic information from modern Java codebases. We evaluate five open-source LLMs (including LLaMA-3.1, Gemma-2, Phi-3, Mistral, Qwen-2.5) using zero-shot, few-shot, and fine-tuned setups and provide a comparative analysis of their performance. Our results demonstrate that LLaMA 3.1 performs consistently well and is a reliable candidate for practical, automated Javadoc generation, offering a viable alternative to proprietary systems.

pdf bib abs
LogitRouter: a novel Attention variant for reducing Myopic Routing in Mixture of Experts
Felipe Rodriguez | Marcelo Mendoza

Mixture of Experts (MoEs) have emerged as strong alternatives to traditional transformers, offering significant advantages in terms of training and inference efficiency. At the core of this architecture lies the router, responsible for selecting which experts are activated for each token. However, despite these advances, routing mechanisms continue to face stability challenges that the basic architecture fails to fully address. One such issue is Myopic Routing, where each token determines its route independently, without considering the routing decisions made for other tokens. To address this limitation, the LogitAttention mechanism is introduced—a variant of traditional attention—and, building upon it, the LogitRouter, a novel routing architecture that incorporates contextual information about the routing of other tokens. Due to budget constraints, a set of simple experiments is designed to obtain preliminary evidence of performance trends. These experiments are empirically validated on established benchmarks such as BoolQ, MMLU, and ARC. Finally, the work concludes with an in-depth discussion of architectural variants, applicability, limitations, and future directions, which aims to support continued research in this area.

Inconsistent naming of menu items across merchants presents a major challenge for businesses that rely on large-scale menu item catalogs. It hinders downstream tasks like pricing analysis, menu item deduplication, and recommendations. To address this, we propose the Cross-Platform Semantic Alignment Framework (CPSAF), a hybrid approach that integrates DBSCAN-based clustering with SIGMA (Semantic Item Grouping and Menu Abstraction), a Large Language Model based refinement module. SIGMA employs in-context learning with a large language model to generate generic menu item names and categories. We evaluate our framework on a proprietary dataset comprising over 700,000 unique menu items. Experiments involve tuning DBSCAN parameters and applying SIGMA to refine clusters. The performance is assessed using both structural metrics i.e. cluster count, coverage and semantic metrics i.e. intra and inter-cluster similarity along with manual qualitative inspection. CPSAF improves intra-cluster similarity from 0.88 to 0.98 and reduces singleton clusters by 33%, demonstrating its effectiveness in recovering soft semantic drift.

Large Language Models (LLMs) for Generative AI have achieved remarkable progress, evolving into sophisticated and versatile tools widely adopted across various domains and applications. However, the substantial memory overhead caused by their vast number of parameters, combined with the high computational demands of the attention mechanism, poses significant challenges in achieving low latency and high throughput for LLM inference services. Recent advancements, driven by groundbreaking research, have significantly accelerated progress in this field. This paper provides a comprehensive survey of these methods, covering fundamental instance-level approaches, in-depth cluster-level strategies, and emerging scenarios. At the instance level, we review model placement, request scheduling, decoding length prediction, storage management, and the disaggregation paradigm. At the cluster level, we explore GPU cluster deployment, multi-instance load balancing, and cloud service solutions. Additionally, we discuss specific tasks, modules, and auxiliary methods in emerging scenarios. Finally, we outline potential research directions to further advance the field of LLM inference serving.

pdf bib abs
Are Multi-Agents the new Pipeline Architecture for Data-to-Text Systems?
Chinonso Cynthia Osuji | Brian Timoney | Mark Andrade | Thiago Castro Ferreira | Brian Davis

Large Language Models (LLMs) have achieved remarkable results in natural language generation, yet challenges remain in data-to-text (D2T) tasks, particularly in controlling output, ensuring transparency, and maintaining factual consistency with the input. We introduce the first LLM-based multi-agent framework for D2T generation, coordinating specialized agents to produce high-quality, interpretable outputs. Our system combines the reasoning and acting abilities of ReAct agents, the self-correction of Reflexion agents, and the quality assurance of Guardrail agents, all directed by an Orchestrator agent that assigns tasks to three specialists—content ordering, text structuring, and surface realization—and iteratively refines outputs based on Guardrail feedback. This closed-loop design enables precise control and dynamic optimization, yielding text that is coherent, accurate, and grounded in the input data. On a relatively simple dataset like WebNLG, our framework performs competitively with end-to-end systems, highlighting its promise for more complex D2T scenarios.

pdf bib abs
Generating Impact and Critique Explanations of Predictions made by a Goal Recognizer
Jair da Silva Ferreira Junior | Ingrid Zukerman | Enes Makalic | Cecile L. Paris | Mor Vered

In this paper, we generate two types of explanations, Impact and Critique, of predictions made by a Goal Recognizer (GR) – a system that infers agents’ goals from observations. Impact explanations describe the top-predicted goal(s) and the main observations that led to these predictions. Critique explanations augment these explanations with evidence that challenges the GR’s predictions if so warranted. Our user study compares users’ goal-recognition accuracy for Impact and Critique explanations, and users’ views about these explanations, under three prediction-correctness conditions: correct, partially correct and incorrect. Our results show that (1) users stick with a GR’s predictions, even when a Critique explanation highlights its flaws; yet (2) Critique explanations are deemed better than Impact explanations in most respects.

pdf bib abs
PRICoT: Principle Retrieval and Injection from Inference Successes and Failures for CoT Improvement
Yudai Yamazaki | Naoto Takeda | Yasutaka Nishimura | Kazushi Ikeda

In-Context Learning (ICL) approaches, such as Zero-Shot and Few-Shot prompting, allow Large Language Models (LLMs) to tackle reasoning tasks without additional fine-tuning. However, Zero-Shot prompting often struggles with more complex tasks, whereas Few-Shot prompting demands considerable manual effort and domain expertise to design effective prompts. Although existing work has attempted to alleviate these issues by extracting reasoning rules from carefully crafted, task-specific representative examples, creating or obtaining such examples can be impractical in real-world scenarios. In this paper, we propose a novel approach that enhances the inference accuracy by injecting reasoning principles extracted from QA data, without relying on representative Few-Shot exemplars. This offers a lightweight yet adaptive way to boost accuracy on complex reasoning tasks, while avoiding manual effort and the high exploration costs typical of prior methods. Experiments on benchmarks show that, using GPT-4o, our method outperforms similarity-based Few-Shot and Zero-Shot prompting methods on challenging benchmarks such as GPQA-diamond, achieving an absolute accuracy improvement of up to 2% in scenarios where carefully crafted Few-Shot examples are unavailable.

pdf bib abs
Cognitive Flow: An LLM-Automated Framework for Quantifying Reasoning Distillation
José Matos | Catarina Silva | Hugo Goncalo Oliveira

The ability of large language models (LLMs) to reason effectively is crucial for a wide range of applications, from complex decision-making to scientific research. However, it remains unclear how well reasoning capabilities are transferred or preserved when LLMs undergo Knowledge Distillation (KD), a process that typically reduces model size while attempting to retain performance. In this study, we explore the effects of model distillation on the reasoning abilities of various reasoning language models (RLMs). We introduce Cognitive Flow, a novel framework that systematically extracts meaning and map states in Chain-of-Thought (CoT) processes, offering new insights on model reasoning and enabling quantitative comparisons across RLMs. Using this framework, we investigate the impact of KD on CoTs produced by RLMs. We target DeepSeek-R1-671B and its distilled 70B, 32B and 14B versions, as well as QwenQwQ-32B from the Qwen series. We evaluate the models on three subsets of mathematical reasoning tasks with varying complexity from the MMLU benchmark. Our findings demonstrate that while distillation can effectively replicate a similar reasoning style under specific conditions, it struggles with simpler problems, revealing a significant divergence in the observable thought process and a potential limitation in the transfer of a robust and adaptable problem-solving capability.

pdf bib abs
How (un)faithful are explainable LLM-based NLG metrics?
Alex Terentowicz | Mateusz Lango | Ondrej Dusek

Explainable NLG metrics are becoming a popular research topic; however, the faithfulness of the explanations they provide is typically not evaluated. In this work, we propose a testbed for assessing the faithfulness of span-based metrics by performing controlled perturbations of their explanations and observing changes in the final score. We show that several popular LLM evaluators do not consistently produce faithful explanations.

pdf bib abs
Counterfactual Simulatability of LLM Explanations for Generation Tasks
Marvin Limpijankit | Yanda Chen | Melanie Subbiah | Nicholas Deas | Kathleen McKeown

LLMs can be unpredictable, as even slight alterations to the prompt can cause the output to change in unexpected ways. Thus, the ability of models to accurately explain their behavior is critical, especially in high-stakes settings. Counterfactual simulatability measures how well an explanation allows users to infer the model’s output on related counterfactuals and has been previously studied for yes/no question answering. We provide a general framework for extending this method to generation tasks, using news summarization and medical suggestion as example use cases. We find that while LLM explanations do enable users to better predict their outputs on counterfactuals in the summarization setting, there is significant room for improvement for medical suggestion. Furthermore, our results suggest that evaluating counterfactual simulatability may be more appropriate for skill-based tasks as opposed to knowledge-based tasks.

pdf bib abs
SWI: Speaking with Intent in Large Language Models
Yuwei Yin | Eunjeong Hwang | Giuseppe Carenini

Intent, typically clearly formulated and planned, functions as a cognitive framework for communication and problem-solving. This paper introduces the concept of Speaking with Intent (SWI) in large language models (LLMs), where the explicitly generated intent encapsulates the model’s underlying intention and provides high-level planning to guide subsequent analysis and action. By emulating deliberate and purposeful thoughts in the human mind, SWI is hypothesized to enhance the reasoning capabilities and generation quality of LLMs. Extensive experiments on text summarization, multi-task question answering, and mathematical reasoning benchmarks consistently demonstrate the effectiveness and generalizability of Speaking with Intent over direct generation without explicit intent. Further analysis corroborates the generalizability of SWI under different experimental settings. Moreover, human evaluations verify the coherence, effectiveness, and interpretability of the intent produced by SWI. The promising results in enhancing LLMs with explicit intents pave a new avenue for boosting LLMs’ generation and reasoning abilities with cognitive notions.

pdf bib abs
Forecasting Conversation Derailments Through Generation
Yunfan Zhang | Kathleen McKeown | Smaranda Muresan

Forecasting conversation derailment can be useful in real-world settings such as online content moderation, conflict resolution, and business negotiations. However, despite language models’ success at identifying offensive speech present in conversations, they struggle to forecast future conversation derailments. In contrast to prior work that predicts conversation outcomes solely based on the past conversation history, our approach samples multiple future conversation trajectories conditioned on existing conversation history using a fine-tuned LLM. It predicts the conversation outcome based on the consensus of these trajectories. We also experimented with leveraging socio-linguistic attributes, which reflect turn-level conversation dynamics, as guidance when generating future conversations. Our method of future conversation trajectories surpasses state-of-the-art results on English conversation derailment prediction benchmarks and demonstrates significant accuracy gains in ablation studies.

pdf bib abs
Can LLMs Help Encoder Models Maintain Both High Accuracy and Consistency in Temporal Relation Classification?
Adiel Meir | Kfir Bar

Temporal relation classification (TRC) demands both accuracy and temporal consistency in event timeline extraction. Encoder-based models achieve high accuracy but introduce inconsistencies because they rely on pairwise classification, while LLMs leverage global context to generate temporal graphs, improving consistency at the cost of accuracy. We assess LLM prompting strategies for TRC and their effectiveness in assisting encoder models with cycle resolution. Results show that while LLMs improve consistency, they struggle with accuracy and do not outperform a simple confidence-based cycle resolution approach. Our code is publicly available at: https://github.com/MatufA/timeline-extraction.

pdf bib abs
Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents
Fuyu Xing | Zimu Wang | Wei Wang | Haiyang Zhang

The proliferation of multimedia content necessitates the development of effective Multimedia Event Extraction (M²E²) systems. Though Large Vision-Language Models (LVLMs) have shown strong cross-modal capabilities, their utility in the M²E² task remains underexplored. In this paper, we present the first systematic evaluation of representative LVLMs, including DeepSeek-VL2 and the Qwen-VL series, on the M²E² dataset. Our evaluations cover text-only, image-only, and cross-media subtasks, assessed under both few-shot prompting and fine-tuning settings. Our key findings highlight the following valuable insights: (1) Few-shot LVLMs perform notably better on visual tasks but struggle significantly with textual tasks; (2) Fine-tuning LVLMs with LoRA substantially enhances model performance; and (3) LVLMs exhibit strong synergy when combining modalities, achieving superior performance in cross-modal settings. We further provide a detailed error analysis to reveal persistent challenges in areas such as semantic precision, localization, and cross-modal grounding, which remain critical obstacles for advancing M²E² capabilities.

Large language models (LLMs) have increasingly been applied to automatic programming code generation. This task can be viewed as a language generation task that bridges natural language, human knowledge, and programming logic. However, it remains underexplored in domains that require interaction with hardware devices, such as quantum programming, where human coders write Python code that is executed on a quantum computer. To address this gap, we introduce QCoder Benchmark, an evaluation framework that assesses LLMs on quantum programming with feedback from simulated hardware devices. Our benchmark offers two key features. First, it supports evaluation using a quantum simulator environment beyond conventional Python execution, allowing feedback of domain-specific metrics such as circuit depth, execution time, and error classification, which can be used to guide better generation. Second, it incorporates human-written code submissions collected from real programming contests, enabling both quantitative comparisons and qualitative analyses of LLM outputs against human-written codes. Our experiments reveal that even advanced models like GPT-4o achieve only around 18.97% accuracy, highlighting the difficulty of the benchmark. In contrast, reasoning-based models such as o3 reach up to 78% accuracy, outperforming averaged success rates of human-written codes (39.98%). We release the QCoder Benchmark dataset and public evaluation API to support further research.

pdf bib abs
When LLMs Can’t Help: Real-World Evaluation of LLMs in Nutrition
Karen Jia-Hui Li | Simone Balloccu | Ondrej Dusek | Ehud Reiter

The increasing trust in large language models (LLMs), especially in the form of chatbots, is often undermined by the lack of their extrinsic evaluation. This holds particularly true in nutrition, where randomised controlled trials (RCTs) are the gold standard, and experts demand them for evidence-based deployment. LLMs have shown promising results in this field, but these are limited to intrinsic setups. We address this gap by running the first RCT involving LLMs for nutrition. We augment a rule-based chatbot with two LLM-based features: (1) message rephrasing for conversational variety and engagement, and (2) nutritional counselling through a fine-tuned model. In our seven-week RCT (n=81), we compare chatbot variants with and without LLM integration. We measure effects on dietary outcome, emotional well-being, and engagement. Despite our LLM-based features performing well in intrinsic evaluation, we find that they did not yield consistent benefits in real-world deployment. These results highlight critical gaps between intrinsic evaluations and real-world impact, emphasising the need for interdisciplinary, human-centred approaches.

pdf bib abs
Who’s Laughing Now? An Overview of Computational Humour Generation and Explanation
Tyler Loakman | William Thorne | Chenghua Lin

The creation and perception of humour is a fundamental human trait, positioning its computational understanding as one of the most challenging tasks in natural language processing (NLP). As an abstract, creative, and frequently context-dependent construct, humour requires extensive reasoning to understand and create, making it a pertinent task for assessing the common-sense knowledge and reasoning abilities of modern large language models (LLMs). In this work, we survey the landscape of computational humour as it pertains to the generative tasks of creation and explanation. We observe that, despite the task of understanding humour bearing all the hallmarks of a foundational NLP task, work on generating and explaining humour beyond puns remains sparse, while state-of-the-art models continue to fall short of human capabilities. We bookend our literature survey by motivating the importance of computational humour processing as a subdiscipline of NLP and presenting an extensive discussion of future directions for research in the area that takes into account the subjective and ethically ambiguous nature of humour.

pdf bib abs
Input Matters: Evaluating Input Structure’s Impact on LLM Summaries of Sports Play-by-Play
Barkavi Sundararajan | Somayajulu Sripada | Ehud Reiter

A major concern when deploying LLMs in accuracy-critical domains such as sports reporting is that the generated text may not faithfully reflect the input data. We quantify how input structure affects hallucinations and other factual errors in LLM-generated summaries of NBA play-by-play data, across three formats: row-structured, JSON and unstructured. We manually annotated 3,312 factual errors across 180 game summaries produced by two models, Llama-3.1-70B and Qwen2.5-72B. Input structure has a strong effect: JSON input reduces error rates by 69% for Llama and 65% for Qwen compared to unstructured input, while row-structured input reduces errors by 54% for Llama and 51% for Qwen. A two-way repeated-measures ANOVA shows that input structure accounts for over 80% of the variance in error rates, with Tukey HSD post hoc tests confirming statistically significant differences between all input formats.

The ability of LLMs to write coherent, faithful long texts from structured data inputs remains relatively uncharted, in part because nearly all public data-to-text datasets contain only short input-output pairs. To address these gaps, we benchmark six LLMs, a rule‐based system and human-written texts on a new long-input dataset in English and Irish via LLM-based evaluation. We find substantial differences between models and languages.

pdf bib abs
Annotating Hallucinations in Question-Answering using Rewriting
Xu Liu | Guanyi Chen | Kees van Deemter | Tingting He

Hallucinations pose a persistent challenge in open-ended question answering (QA). Traditional annotation methods, such as span-labelling, suffer from inconsistency and limited coverage. In this paper, we propose a rewriting-based framework as a new perspective on hallucinations in open-ended QA. We report on an experiment in which annotators are instructed to rewrite LLM-generated answers directly to ensure factual accuracy, with edits automatically recorded. Using the Chinese portion of the Mu-SHROOM dataset, we conduct a controlled rewriting experiment, comparing fact-checking tools (Google vs. GPT-4o), and analysing how tool choice, annotator background, and question openness influence rewriting behaviour. We find that rewriting leads to more hallucinations being identified, with higher inter-annotator agreement, than span-labelling.

pdf bib abs
Effectiveness of Chain-of-Thought in Distilling Reasoning Capability from Large Language Models
Cong Thanh Do | Rama Sanand Doddipatla | Kate Knill

Chain-of-Thought (CoT) prompting is a widely used method to improve the reasoning capability of Large Language Models (LLMs). More recently, CoT has been leveraged in Knowledge Distillation (KD) to transfer reasoning capability from a larger LLM to a smaller one. This paper examines the role of CoT in distilling the reasoning capability from larger LLMs to smaller LLMs using white-box KD, analyzing its effectiveness in improving the performance of the distilled models for various natural language reasoning and understanding tasks. We conduct white-box KD experiments using LLMs from the Qwen and Llama2 families, employing CoT data from the CoT-Collection dataset. The distilled models are then evaluated on natural language reasoning and understanding tasks from the BIG-Bench-Hard (BBH) benchmark, which presents complex challenges for smaller LLMs. Experimental results demonstrate the role of CoT in improving white-box KD effectiveness, enabling the distilled models to achieve better average performance in natural language reasoning and understanding tasks from BBH.

pdf bib abs
Face the Facts! Evaluating RAG-based Pipelines for Professional Fact-Checking
Daniel Russo | Stefano Menini | Jacopo Staiano | Marco Guerini

Natural Language Processing and Generation systems have recently shown the potential to complement and streamline the costly and time-consuming job of professional fact-checkers. In this work, we lift several constraints of current state-of-the-art pipelines for automated fact-checking based on the Retrieval-Augmented Generation (RAG) paradigm. Our goal is to benchmark, following professional fact-checking practices, RAG-based methods for the generation of verdicts - i.e., short texts discussing the veracity of a claim - evaluating them on stylistically complex claims and heterogeneous, yet reliable, knowledge bases. Our findings show a complex landscape, where, for example, LLM-based retrievers outperform other retrieval techniques, though they still struggle with heterogeneous knowledge bases; larger models excel in verdict faithfulness, while smaller models provide better context adherence, with human evaluations favouring zero-shot and one-shot approaches for informativeness, and fine-tuned models for emotional alignment.