From Guidelines to Guarantees: A Graph-Based Evaluation Harness for Domain-Specific Evaluation of LLMs
Jessica M. Lundin, Usman Nasir Nakakana, Guillaume Chabot-Couture
Abstract
Rigorous evaluation of domain-specific language models requires benchmarks that are comprehensive, contamination-resistant, and maintainable. Static, manually curated datasets do not satisfy these properties. We present a graph-based evaluation harness that transforms structured clinical guidelines into a queryable knowledge graph and dynamically instantiates evaluation queries via graph traversal. The framework provides three guarantees: (1) complete coverage of guideline relationships; (2) surface-form contamination resistance through combinatorial variation; and (3) validity inherited from expert-authored graph structure. Applied to the WHO IMCI guidelines, the harness generates clinically grounded multiple-choice questions spanning symptom recognition, treatment, severity classification, and follow-up care. Evaluation across five language models reveals systematic capability gaps. Models perform well on symptom recognition but show lower accuracy on treatment protocols and clinical management decisions. The framework supports continuous regeneration of evaluation data as guidelines evolve and generalizes to domains with structured decision logic. This provides a scalable foundation for evaluation infrastructure. Data and Code Availability The WHO IMCI handbook is publicly available (WHO, 2014). Our graph construction, question generation code, and generated question dataset are available at https://github.com/jessicalundin/ graph_testing_harness.- Anthology ID:
- 2026.evaleval-1.34
- Volume:
- Proceedings of the Workshop on Evaluating Evaluations (EvalEval)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, CA
- Editors:
- Mubashara Akhtar, Jan Batzner, Leshem Choshen, Avijit Ghosh, Usman Gohar, Jennifer Mickel, Ichhya Pant, Zeerak Talat, Michelle Lin
- Venues:
- EvalEval | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 211–220
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.evaleval-1.34/
- DOI:
- Cite (ACL):
- Jessica M. Lundin, Usman Nasir Nakakana, and Guillaume Chabot-Couture. 2026. From Guidelines to Guarantees: A Graph-Based Evaluation Harness for Domain-Specific Evaluation of LLMs. In Proceedings of the Workshop on Evaluating Evaluations (EvalEval), pages 211–220, San Diego, CA. Association for Computational Linguistics.
- Cite (Informal):
- From Guidelines to Guarantees: A Graph-Based Evaluation Harness for Domain-Specific Evaluation of LLMs (Lundin et al., EvalEval 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.evaleval-1.34.pdf