Can Large Language Models Accurately Generate Answer Keys for Health-related Questions?

Davis Bartels, Deepak Gupta, Dina Demner-Fushman


Abstract
The evaluation of text generated by LLMs remains a challenge for question answering, retrieval augmented generation (RAG), summarization, and many other natural language processing tasks. Evaluating the factuality of LLM generated responses is particularly important in medical question answering, where the stakes are high. One method of evaluating the factuality of text is through the use of information nuggets (answer keys). Nuggets are text representing atomic facts that may be used by an assessor to make a binary decision as to whether the fact represented by said nugget is contained in an answer. Although manual nugget extraction is expensive and time-consuming, recent RAG shared task evaluations have explored automating the nuggetization of text with LLMs. In this work, we explore several approaches to nugget generation for medical question answering and evaluate their alignment with expert human nugget generation. We find providing an example and extracting nuggets from an answer to be the best approach to nuggetization. While, overall, we found the capabilities of LLMs to distill atomic facts limited, Llama 3.3 performed the best out of the models we tested.
Anthology ID:
2025.acl-short.28
Volume:
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
354–368
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.acl-short.28/
DOI:
Bibkey:
Cite (ACL):
Davis Bartels, Deepak Gupta, and Dina Demner-Fushman. 2025. Can Large Language Models Accurately Generate Answer Keys for Health-related Questions?. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 354–368, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Can Large Language Models Accurately Generate Answer Keys for Health-related Questions? (Bartels et al., ACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.acl-short.28.pdf