JudgeAgent: Beyond Static Benchmarks for Knowledge-Driven and Dynamic LLM Evaluation

Zhichao Shi, Xuhui Jiang, Chengjin Xu, Cangli Yao, Shengjie Ma, Yinghan Shen, Zixuan Li, Jian Guo, Yuanzhuo Wang


Abstract
Current evaluation methods for large language models (LLMs) primarily rely on static benchmarks, presenting two major challenges: limited knowledge coverage and fixed difficulties that mismatch with the evaluated LLMs. These limitations lead to superficial assessments of LLM knowledge, thereby impeding the targeted model optimizations.To bridge this gap, we propose JudgeAgent, a knowledge-driven and dynamic evaluation framework for LLMs.To address the challenge of limited knowledge coverage, JudgeAgent leverages LLM agents equipped with context graphs to traverse knowledge structures systematically for question generation.Furthermore, to mitigate data contamination and difficulty mismatch, it adopts a difficulty-adaptive and multi-turn interview mechanism.Thereby, JudgeAgent can achieve comprehensive evaluations and facilitate more effective improvement of LLMs.Empirical results demonstrate that JudgeAgent enables more comprehensive evaluations and facilitates effective model iterations, highlighting the potential of this knowledge-driven and dynamic evaluation paradigm.The source code is available on https://github.com/DataArcTech/JudgeAgent.
Anthology ID:
2026.findings-acl.634
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13004–13030
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.634/
DOI:
Bibkey:
Cite (ACL):
Zhichao Shi, Xuhui Jiang, Chengjin Xu, Cangli Yao, Shengjie Ma, Yinghan Shen, Zixuan Li, Jian Guo, and Yuanzhuo Wang. 2026. JudgeAgent: Beyond Static Benchmarks for Knowledge-Driven and Dynamic LLM Evaluation. In Findings of the Association for Computational Linguistics: ACL 2026, pages 13004–13030, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
JudgeAgent: Beyond Static Benchmarks for Knowledge-Driven and Dynamic LLM Evaluation (Shi et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.634.pdf
Checklist:
 2026.findings-acl.634.checklist.pdf