MohammadJavad Ardestani

2026

Do Nugget-Based Evaluation Patterns Generalize to List-QA?
MohammadJavad Ardestani | Ehsan Kamalloo | Davood Rafiei
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)

Evaluating long-form answers from retrieval-augmented generation (RAG) systems remains challenging: human evaluation is expensive, while automatic metrics must reliably capture answer completeness. The AutoNuggetizer framework addresses this by decomposing evaluation into atomic facts (nuggets) and using LLMs for both nugget creation and assignment. The original study validated this approach on open-ended TREC RAG queries; however, it remains unclear whether the same cost-quality tradeoffs hold for structurally different tasks. We reproduce AutoNuggetizer on seven RAG systems over the QAMPARI list-QA benchmark, where answers consist of discrete entities and omissions are more directly measurable. Our results directionally reproduce the original findings: fully automatic evaluation preserves run-level rankings, assignment-only automation yields stronger agreement than end-to-end automation, and LLM-based assignment is highly concordant with human labels while being modestly stricter. These findings support the use of AutoNuggetizer for comparative evaluation beyond open-ended RAG, while also identifying systematic biases in automatic nugget creation and assignment.

Co-authors

Ehsan Kamalloo 1
Davood Rafiei 1

Venues

GEM1
WS1

Fix author