Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA

Sher Badshah, Hassan Sajjad


Abstract
The emergence of Large Language Models (LLMs) as chat assistants capable of generating human-like conversations has amplified the need for robust evaluation methods, particularly for open-ended tasks. Conventional metrics such as EM and F1, while useful, are inadequate for capturing the full semantics and contextual depth of such generative outputs. We propose a reference-guided verdict method that automates the evaluation process by leveraging multiple LLMs as judges. Through experiments on free-form question-answering tasks, we demonstrate that combining multiple models improves the reliability and accuracy of evaluations, especially in tasks where a single model may struggle. The results indicate a strong correlation with human evaluations, establishing the proposed method as a reliable alternative to traditional metrics.
Anthology ID:
2025.winlp-main.37
Volume:
Proceedings of the 9th Widening NLP Workshop
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Chen Zhang, Emily Allaway, Hua Shen, Lesly Miculicich, Yinqiao Li, Meryem M'hamdi, Peerat Limkonchotiwat, Richard He Bai, Santosh T.y.s.s., Sophia Simeng Han, Surendrabikram Thapa, Wiem Ben Rim
Venues:
WiNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
251–267
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.winlp-main.37/
DOI:
Bibkey:
Cite (ACL):
Sher Badshah and Hassan Sajjad. 2025. Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA. In Proceedings of the 9th Widening NLP Workshop, pages 251–267, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA (Badshah & Sajjad, WiNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.winlp-main.37.pdf