SAGE: A Search-AuGmented Evaluation of Large Language Models on Free-Form QA

Sher Badshah; Ali Emami; Hassan Sajjad

SAGE: A Search-AuGmented Evaluation of Large Language Models on Free-Form QA

Abstract

As Large Language Models (LLMs) become increasingly used for question-answering (QA), relying on static, pre-annotated references for evaluation poses significant challenges in cost, scalability, and completeness. Meanwhile, using LLMs themselves as evaluators without external grounding remains unreliable for objective tasks, as they systematically over-accept incorrect answers, fabricate supporting rationales, and degrade sharply on questions that fall outside their training data. We propose Search-AuGmented Evaluation (SAGE), a framework to assess LLM outputs without fixed ground-truth answers. Unlike conventional metrics that compare to static references or depend solely on LLM-as-a-judge knowledge, SAGE acts as an agent that actively retrieves and synthesizes external evidence. It iteratively generates web queries, collects information, summarizes findings, and refines subsequent searches through reflection. By reducing dependence on static reference-driven evaluation protocols, SAGE offers a scalable and adaptive alternative for evaluating the factuality of LLMs. Experimental results on multiple free-form QA benchmarks show that SAGE achieves substantial to perfect agreement with human evaluations.

Anthology ID:: 2026.acl-long.66
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1466–1491
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.66/
DOI:
Bibkey:
Cite (ACL):: Sher Badshah, Ali Emami, and Hassan Sajjad. 2026. SAGE: A Search-AuGmented Evaluation of Large Language Models on Free-Form QA. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1466–1491, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: SAGE: A Search-AuGmented Evaluation of Large Language Models on Free-Form QA (Badshah et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.66.pdf
Checklist:: 2026.acl-long.66.checklist.pdf

PDF Cite Search Checklist Fix data