Process Evaluation for Agentic Systems

Milan Gritta, Debjit Paul, Xiaoguang Li, Lifeng Shang, Jun Wang, Gerasimos Lampouras


Abstract
The significance of tasks entrusted to LLM-based assistants (agents) and the associated societal risks are increasing each year. Agents are being explored in critical domains such as medicine, finance, law, infrastructure, and other sensitive applications that require system transparency and high user trust. The quality of these agents is typically evaluated by accuracy, sometimes extended to partial correctness. In this position paper, we argue that this focus on outcomes is insufficient as it can obscure risky agent behaviours such as skipping critical steps, hallucinating tool use, relying on outdated parametric knowledge and other means of bypassing recommended processes. Our core position is that a holistic agent evaluation must include process evaluation, especially for critical applications. We conduct a small-scale study to assess the feasibility of automatic process evaluation, present a compliance score, analyse use cases of bad and good behaviours, and offer recommendations for more holistic evaluation.
Anthology ID:
2026.findings-eacl.140
Volume:
Findings of the Association for Computational Linguistics: EACL 2026
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2678–2692
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.140/
DOI:
Bibkey:
Cite (ACL):
Milan Gritta, Debjit Paul, Xiaoguang Li, Lifeng Shang, Jun Wang, and Gerasimos Lampouras. 2026. Process Evaluation for Agentic Systems. In Findings of the Association for Computational Linguistics: EACL 2026, pages 2678–2692, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Process Evaluation for Agentic Systems (Gritta et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.140.pdf
Checklist:
 2026.findings-eacl.140.checklist.pdf