This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
MichelaLorandi
Fixing paper assignments
Please select all papers that do not belong to this person.
Indicate below which author they should be assigned to.
Systematic reviews in medicine play a critical role in evidence-based decision-making by aggregating findings from multiple studies. A central bottleneck in automating this process is extracting numeric evidence and determining study-level conclusions for specific outcomes and comparisons. Prior work has framed this problem as a textual inference task by retrieving relevant content fragments and inferring conclusions from them. However, such approaches often rely on shallow textual cues and fail to capture the underlying numeric reasoning behind expert assessments.In this work, we conceptualise the problem as one of quantitative reasoning. Rather than inferring conclusions from surface text, we extract structured numerical evidence (e.g., event counts or standard deviations) and apply domain knowledge informed logic to derive outcome-specific conclusions. We develop a numeric reasoning system composed of a numeric data extraction model and an effect estimate component, enabling more accurate and interpretable inference aligned with the domain expert principles. We train the numeric data extraction model using different strategies, including supervised fine-tuning (SFT) and reinforcement learning (RL) with a new value reward model.When evaluated on the CochraneForest benchmark, our best-performing approach – using RL to train a small-scalenumber extraction model – yields up to a 21% absolute improvement in F1 score over retrieval-based systems and outperforms general-purpose LLMs of over 400B parameters by up to 9%.Our results demonstrate the promise of reasoning-driven approaches for automating systematic evidence synthesis.
In this paper, we present our reproduction of part of the human evaluation originally carried out by Gu et al. (2022), as part of Track B of ReproNLP 2025. Four human annotators were asked to rank two candidate summaries according to their overall quality, given a reference summary shown alongside the two candidate summaries at evaluation time. We describe the original experiment and provide details about the steps we followed to carry out the reproduction experiment, including the implementation of some missing pieces of code. Our results, in particular the high coefficients of variation and low inter-annotator agreement, suggest a low level of reproducibility in the original experiment despite identical pairwise ranks. However, given the very small sample size (two systems, one rating), we remain cautious about drawing definitive conclusions.
Ensuring semantic consistency between semantic-triple inputs and generated text is crucial in data‐to‐text generation, but continues to pose challenges both during generation and in evaluation. In order to assess how accurately semantic consistency can currently be assessed, we meta-evaluate 29 different evaluation methods in terms of their ability to predict human semantic-consistency ratings. The evaluation methods include embeddings‐based, overlap‐based, and edit‐distance metrics, as well as learned regressors and a prompted ‘LLM‐as‐judge’ protocol. We meta-evaluate on two datasets: the WebNLG 2017 human evaluation dataset, and a newly created WebNLG-style dataset that none of the methods can have seen during training. We find that none of the traditional textual similarity metrics or the pre-Transformer model-based metrics are suitable for the task of semantic consistency assessment. LLM-based methods perform well on the whole, but best correlations with human judgments still lag behind those seen in other text generation tasks.
Wikipédia a des lacunes systématiques dans sa couverture des langues peu dotées ainsi que des groupes sous-représentés (par exemple, les femmes). Cet article présente un nouvel outil pour soutenir les efforts visant à combler ces lacunes en générant automatiquement des débuts d’articles en anglais, français et irlandais, et en facilitant la post-édition et la mise en ligne sur Wikipédia. Un générateur basé sur des règles et un LLM sont utilisés pour générer deux articles alternatifs à partir de graphes de connaissances DBpedia ou Wikidata sélectionnés par l’utilisateur, permettant à l’article généré via LLM, souvent plus fluide mais plus sujet aux erreurs, d’être vérifié en termes de contenu par rapport à l’article généré par des règles, plus fiable, mais moins fluide. Le code de l’outil est disponible sur https://github.com/dcu-nlg/wiki-gen-demo et il est actuellement déployé sur http://ec2-18-224-151-90.us-east-2.compute.amazonaws.com:3000/.
The performance of NLP methods for severely under-resourced languages cannot currently hope to match the state of the art in NLP methods for well resourced languages. We explore the extent to which pretrained large language models (LLMs) can bridge this gap, via the example of data-to-text generation for Irish, Welsh, Breton and Maltese. We test LLMs on these under-resourced languages and English, in a range of scenarios. We find that LLMs easily set the state of the art for the under-resourced languages by substantial margins, as measured by both automatic and human evaluations. For all our languages, human evaluation shows on-a-par performance with humans for our best systems, but BLEU scores collapse compared to English, casting doubt on the metric’s suitability for evaluating non-task-specific systems. Overall, our results demonstrate the great potential of LLMs to bridge the performance gap for under-resourced languages.
Rerunning a metric-based evaluation should be more straightforward and results should be closer than in a human-based evaluation, especially where code and model checkpoints are made available by the original authors. As this brief report of our efforts to rerun a metric-based evaluation of a set of multi-aspect controllable text generation (CTG) techniques shows however, such reruns of evaluations do not always produce results that are the same as the original results, and can reveal errors in the orginal work.
In this paper, we investigate how different semantic, or content-related, errors made by different types of data-to-text systems differ in terms of number and type. In total, we examine 15 systems: three rule-based and 12 neural systems including two large language models without training or fine-tuning. All systems were tested on the English WebNLG dataset version 3.0. We use a semantic error taxonomy and the brat annotation tool to obtain word-span error annotations on a sample of system outputs. The annotations enable us to establish how many semantic errors different (types of) systems make and what specific types of errors they make, and thus to get an overall understanding of semantic strengths and weaknesses among various types of NLG systems. Among our main findings, we observe that symbolic (rule and template-based) systems make fewer semantic errors overall, non-LLM neural systems have better fluency and data coverage, but make more semantic errors, while LLM-based systems require improvement particularly in addressing superfluous.
Wikipedia is known to have systematic gaps in its coverage that correspond to under-resourced languages as well as underrepresented groups. This paper presents a new tool to support efforts to fill in these gaps by automatically generating draft articles and facilitating post-editing and uploading to Wikipedia. A rule-based generator and an input-constrained LLM are used to generate two alternative articles, enabling the often more fluent, but error-prone, LLM-generated article to be content-checked against the more reliable, but less fluent, rule-generated article.
LLMs have been used in various tasks with impressive success, including data-to-text generation. However, one concern when LLMs are compared to alternative methods is data contamination, in other words, for many datasets the data used in training these models may have included publicly available test sets. In this paper, we explore the performance of LLMs using newly constructed datasets in the context of data-to-text generation for English, Chinese, German, Russian, Spanish, Korean, Hindi, Swahili, and Arabic. We performed a testing phase to evaluate a range of prompt types and a fine-tuning technique on Mistral 7B and Falcon 40B. We then fully evaluated the most promising system for each scenario: (i) LLM prompting in English followed by translation, and (ii) LLM PEFT-tuning in English followed by translation. We find that fine-tuning Mistral outperforms all other tested systems and achieves performance close to GPT-3.5. The few-shot prompting with a dynamic selection of examples achieves higher results among prompting. The human evaluation to be carried out by the shared-task organisers will provide insight into the performance of the new datasets. In conclusion, we observed how the fine-tuning of an open-source LLM can achieve good performance close to state-of-the-art closed-source LLM while using considerably fewer resources.
The WebNLG task consists of mapping a knowledge graph to a text verbalising the con- tent of that graph. The 2017 WebNLG edi- tion required participating systems to gener- ate English text from a set of DBpedia triples, while the 2020 WebNLG+ challenge addition- ally included generation into Russian and se- mantic parsing of English and Russian texts. In contrast, WebNLG 2023 focuses on four under-resourced languages which are severely under-represented in research on text genera- tion, namely Breton, Irish, Maltese and Welsh. In addition, WebNLG 2023 once again includes Russian. In this paper, we present the organi- sation of the shared task (data, timeline, eval- uation), briefly describe the participating sys- tems and summarise results for participating systems.
LLMs are great at tasks involving English which dominates in their training data. We explore their ability to address tasks involving languages that are severely under-represented in their training data. More specifically, we do this in the context of data-to-text generation for Irish, Maltese, Welsh and Breton. During the prompt-engineering phase we tested GPT-3.5 and~4 with a range of prompt types and formats on a small sample of example input/output pairs. We then fully evaluated the two most promising prompts in two scenarios: (i) direct generation into the under-resourced languages, and (ii) generation into English followed by translation into the under-resourced languages. We find that few-shot prompting works better for direct generation into under-resourced languages, but that the difference disappears when pivoting via English. The few-shot + translation system variants were submitted to the WebNLG 2023 shared task where they outperformed all other systems by substantial margins in all languages on all automatic metrics. We conclude that good performance can be achieved with state-of-the-art LLMs out-of-the box for under-resourced languages. However, best results (for Welsh) of BLEU 25.12, ChrF++ 0.55, and TER 0.64 are well below the lowest ranked English system at WebNLG’20 with BLEU 0.391, ChrF++ 0.579, and TER 0.564.
Recent advances in the development of large Pretrained Language Models, such as GPT, BERT and Bloom, have achieved remarkable performance on a wide range of different NLP tasks. However, when used for text generation tasks, these models still have limitations when it comes to controlling the content and style of the generated text, often producing content that is incorrect, irrelevant, or inappropriate in the context of a given task. In this survey paper, we explore methods for controllable text generation with a focus on sentiment control. We systematically collect papers from the ACL Anthology, create a categorisation scheme based on different control techniques and controlled attributes, and use the scheme to categorise and compare methods. The result is a detailed and comprehensive overview of state-of-the-art techniques for sentiment-controlled text generation categorised on the basis of how the control is implemented and what attributes are controlled and providing a clear idea of their relative strengths and weaknesses.
Human Evaluation (HE) of automatically generated responses is necessary for the advancement of human-machine dialogue research. Current automatic evaluation measures are poor surrogates, at best. There are no agreed-upon HE protocols and it is difficult to develop them. As a result, researchers either perform non-replicable, non-transparent and inconsistent procedures or, worse, limit themselves to automated metrics. We propose to standardize the human evaluation of response generation models by publicly sharing a detailed protocol. The proposal includes the task design, annotators recruitment, task execution, and annotation reporting. Such protocol and process can be used as-is, as-a-whole, in-part, or modified and extended by the research community. We validate the protocol by evaluating two conversationally fine-tuned state-of-the-art models (GPT-2 and T5) for the complex task of personalized response generation. We invite the community to use this protocol - or its future community amended versions - as a transparent, replicable, and comparable approach to HE of generated responses.