Alessandra Pascale
2026
FactCorrector: A Graph-Inspired Approach to Long-Form Factuality Correction of Large Language Models
Javier Carnerero-Cano | Massimiliano Pronesti | Radu Marinescu | Tigran T. Tchrakian | James Barry | Jasmina Gajcin | Yufang Hou | Alessandra Pascale | Elizabeth M. Daly
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Javier Carnerero-Cano | Massimiliano Pronesti | Radu Marinescu | Tigran T. Tchrakian | James Barry | Jasmina Gajcin | Yufang Hou | Alessandra Pascale | Elizabeth M. Daly
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) are widely used in knowledge-intensive applications but often generate factually incorrect responses. A promising approach to rectify these flaws is correcting LLMs using feedback. Therefore, in this paper, we introduce FactCorrector, a new post-hoc correction method that adapts across domains without retraining and leverages structured feedback about the factuality of the original response to generate a correction. To support rigorous evaluations of factuality correction methods, we also develop the VELI5 benchmark, a novel dataset containing systematically injected factual errors and ground-truth corrections. Experiments on VELI5 and several popular long-form factuality datasets show that the FactCorrector approach significantly improves factual precision while preserving relevance, outperforming strong baselines. We release our code at https://ibm.biz/factcorrector.
Comprehensiveness Metrics for Automatic Evaluation of Factual Recall in Text Generation
Adam Dejl | James Barry | Alessandra Pascale | Javier Carnerero-Cano
Findings of the Association for Computational Linguistics: ACL 2026
Adam Dejl | James Barry | Alessandra Pascale | Javier Carnerero-Cano
Findings of the Association for Computational Linguistics: ACL 2026
Despite demonstrating remarkable performance across a wide range of tasks, large language models (LLMs) have also been found to frequently produce outputs that are incomplete or selectively omit key information. In sensitive domains, such omissions can result in significant harm comparable to that posed by factual inaccuracies, including hallucinations. In this study, we address the challenge of evaluating the comprehensiveness of LLM-generated texts, focusing on the detection of missing information or underrepresented viewpoints. We investigate three automated evaluation metrics: (1) an NLI-based method that decomposes texts into atomic statements and uses natural language inference (NLI) to identify missing facts, (2) a Q A-based metric that extracts question-answer pairs and compares responses across sources, and (3) an end-to-end approach that directly identifies missing content using LLMs. Our experiments demonstrate the surprising effectiveness of the simple end-to-end metric compared to more complex metrics, though at the cost of reduced robustness, interpretability and result granularity. We further assess the comprehensiveness of responses from several popular open-weight LLMs when answering user queries based on multiple sources.
2025
FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models
Radu Marinescu | Debarun Bhattacharjya | Junkyu Lee | Tigran T. Tchrakian | Javier Carnerero-Cano | Yufang Hou | Elizabeth M. Daly | Alessandra Pascale
Findings of the Association for Computational Linguistics: EMNLP 2025
Radu Marinescu | Debarun Bhattacharjya | Junkyu Lee | Tigran T. Tchrakian | Javier Carnerero-Cano | Yufang Hou | Elizabeth M. Daly | Alessandra Pascale
Findings of the Association for Computational Linguistics: EMNLP 2025
Large language models (LLMs) have achieved remarkable success in generative tasks, yet they often fall short in ensuring the factual accuracy of their outputs thus limiting their reliability in real-world applications where correctness is critical. In this paper, we present FactReasoner, a novel neuro-symbolic based factuality assessment framework that employs probabilistic reasoning to evaluate the truthfulness of long-form generated responses. FactReasoner decomposes a response into atomic units, retrieves relevant contextual information from external knowledge sources, and models the logical relationships (e.g., entailment, contradiction) between these units and their contexts using probabilistic encodings. It then estimates the posterior probability that each atomic unit is supported by the retrieved evidence. Our experiments on both labeled and unlabeled benchmark datasets demonstrate that FactReasoner often outperforms state-of-the-art prompt-based methods in terms of factual precision and recall.
Query-driven Document-level Scientific Evidence Extraction from Biomedical Studies
Massimiliano Pronesti | Joao H Bettencourt-Silva | Paul Flanagan | Alessandra Pascale | Oisín Redmond | Anya Belz | Yufang Hou
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Massimiliano Pronesti | Joao H Bettencourt-Silva | Paul Flanagan | Alessandra Pascale | Oisín Redmond | Anya Belz | Yufang Hou
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Extracting scientific evidence from biomedical studies for clinical research questions (e.g., Does stem cell transplantation improve quality of life in patients with medically refractory Crohn’s disease compared to placebo?) is a crucial step in synthesising biomedical evidence. In this paper, we focus on the task of document-level scientific evidence extraction for clinical questions with conflicting evidence. To support this task, we create a dataset called CochraneForest leveraging forest plots from Cochrane systematic reviews. It comprises 202 annotated forest plots, associated clinical research questions, full texts of studies, and study-specific conclusions. Building on CochraneForest, we propose URCA (Uniform Retrieval Clustered Augmentation), a retrieval-augmented generation framework designed to tackle the unique challenges of evidence extraction. Our experiments show that URCA outperforms the best existing methods by up to 10.3% in F1 score on this task. However, the results also underscore the complexity of CochraneForest, establishing it as a challenging testbed for advancing automated evidence synthesis systems.
2020
HBCP Corpus: A New Resource for the Analysis of Behavioural Change Intervention Reports
Francesca Bonin | Martin Gleize | Ailbhe Finnerty | Candice Moore | Charles Jochim | Emma Norris | Yufang Hou | Alison J. Wright | Debasis Ganguly | Emily Hayes | Silje Zink | Alessandra Pascale | Pol Mac Aonghusa | Susan Michie
Proceedings of the Twelfth Language Resources and Evaluation Conference
Francesca Bonin | Martin Gleize | Ailbhe Finnerty | Candice Moore | Charles Jochim | Emma Norris | Yufang Hou | Alison J. Wright | Debasis Ganguly | Emily Hayes | Silje Zink | Alessandra Pascale | Pol Mac Aonghusa | Susan Michie
Proceedings of the Twelfth Language Resources and Evaluation Conference
Due to the fast pace at which research reports in behaviour change are published, researchers, consultants and policymakers would benefit from more automatic ways to process these reports. Automatic extraction of the reports’ intervention content, population, settings and their results etc. are essential in synthesising and summarising the literature. However, to the best of our knowledge, no unique resource exists at the moment to facilitate this synthesis. In this paper, we describe the construction of a corpus of published behaviour change intervention evaluation reports aimed at smoking cessation. We also describe and release the annotation of 57 entities, that can be used as an off-the-shelf data resource for tasks such as entity recognition, etc. Both the corpus and the annotation dataset are being made available to the community.
Search
Fix author
Co-authors
- Yufang Hou 4
- Javier Carnerero-Cano 3
- James Barry 2
- Elizabeth M. Daly 2
- Radu Marinescu 2
- Massimiliano Pronesti 2
- Tigran T. Tchrakian 2
- Anja Belz 1
- Joao H Bettencourt-Silva 1
- Debarun Bhattacharjya 1
- Francesca Bonin 1
- Adam Dejl 1
- Ailbhe Finnerty 1
- Paul Flanagan 1
- Jasmina Gajcin 1
- Debasis Ganguly 1
- Martin Gleize 1
- Emily Hayes 1
- Charles Jochim 1
- Junkyu Lee 1
- Pol Mac Aonghusa 1
- Susan Michie 1
- Candice Moore 1
- Emma Norris 1
- Oisín Redmond 1
- Alison J. Wright 1
- Silje Zink 1