Miriam Wanner
2025
Core: Robust Factual Precision with Informative Sub-Claim Identification
Zhengping Jiang
|
Jingyu Zhang
|
Nathaniel Weir
|
Seth Ebner
|
Miriam Wanner
|
Kate Sanders
|
Daniel Khashabi
|
Anqi Liu
|
Benjamin Van Durme
Findings of the Association for Computational Linguistics: ACL 2025
Hallucinations pose a challenge to the application of large language models (LLMs) thereby motivating the development of metrics to evaluate factual precision. We observe that popular metrics using the Decompose-Then-Verify framework, such as FActScore, can be manipulated by adding obvious or repetitive subclaims to artificially inflate scores. This observation motivates our new customizable plug-and-play subclaim selection component called Core, which filters down individual subclaims according to their uniqueness and informativeness. We show that many popular factual precision metrics augmented by Core are substantially more robust on a wide range of knowledge domains. We release an evaluation framework supporting easy and modular use of Core and various decomposition strategies, which we recommend adoption by the community. We also release an expansion of the FActScore biography dataset to facilitate further studies of decomposition-based factual precision evaluation.
2024
A Closer Look at Claim Decomposition
Miriam Wanner
|
Seth Ebner
|
Zhengping Jiang
|
Mark Dredze
|
Benjamin Van Durme
Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024)
As generated text becomes more commonplace, it is increasingly important to evaluate how well-supported such text is by external knowledge sources. Many approaches for evaluating textual support rely on some method for decomposing text into its individual subclaims which are scored against a trusted reference. We investigate how various methods of claim decomposition—especially LLM-based methods—affect the result of an evaluation approach such as the recently proposed FActScore, finding that it is sensitive to the decomposition method used. This sensitivity arises because such metrics attribute overall textual support to the model that generated the text even though error can also come from the metric’s decomposition step. To measure decomposition quality, we introduce an adaptation of FActScore, which we call DecompScore. We then propose an LLM-based approach to generating decompositions inspired by Bertrand Russell’s theory of logical atomism and neo-Davidsonian semantics and demonstrate its improved decomposition quality over previous methods.
2022
Revisiting the Effects of Leakage on Dependency Parsing
Nathaniel Krasner
|
Miriam Wanner
|
Antonios Anastasopoulos
Findings of the Association for Computational Linguistics: ACL 2022
Recent work by Søgaard (2020) showed that, treebank size aside, overlap between training and test graphs (termed leakage) explains more of the observed variation in dependency parsing performance than other explanations. In this work we revisit this claim, testing it on more models and languages. We find that it only holds for zero-shot cross-lingual settings. We then propose a more fine-grained measure of such leakage which, unlike the original measure, not only explains but also correlates with observed performance variation. Code and data are available here: https://github.com/miriamwanner/reu-nlp-project
Search
Fix author
Co-authors
- Benjamin Van Durme 2
- Seth Ebner 2
- Zheng Ping Jiang 2
- Antonios Anastasopoulos 1
- Mark Dredze 1
- show all...