Do Nugget-Based Evaluation Patterns Generalize to List-QA?

MohammadJavad Ardestani, Ehsan Kamalloo, Davood Rafiei


Abstract
Evaluating long-form answers from retrieval-augmented generation (RAG) systems remains challenging: human evaluation is expensive, while automatic metrics must reliably capture answer completeness. The AutoNuggetizer framework addresses this by decomposing evaluation into atomic facts (nuggets) and using LLMs for both nugget creation and assignment. The original study validated this approach on open-ended TREC RAG queries; however, it remains unclear whether the same cost-quality tradeoffs hold for structurally different tasks. We reproduce AutoNuggetizer on seven RAG systems over the QAMPARI list-QA benchmark, where answers consist of discrete entities and omissions are more directly measurable. Our results directionally reproduce the original findings: fully automatic evaluation preserves run-level rankings, assignment-only automation yields stronger agreement than end-to-end automation, and LLM-based assignment is highly concordant with human labels while being modestly stricter. These findings support the use of AutoNuggetizer for comparative evaluation beyond open-ended RAG, while also identifying systematic biases in automatic nugget creation and assignment.
Anthology ID:
2026.gem-main.84
Volume:
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
Venues:
GEM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1071–1081
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.84/
DOI:
Bibkey:
Cite (ACL):
MohammadJavad Ardestani, Ehsan Kamalloo, and Davood Rafiei. 2026. Do Nugget-Based Evaluation Patterns Generalize to List-QA?. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 1071–1081, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Do Nugget-Based Evaluation Patterns Generalize to List-QA? (Ardestani et al., GEM 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.84.pdf