Process Evaluation for Agentic Systems
Milan Gritta, Debjit Paul, Xiaoguang Li, Lifeng Shang, Jun Wang, Gerasimos Lampouras
Abstract
The significance of tasks entrusted to LLM-based assistants (agents) and the associated societal risks are increasing each year. Agents are being explored in critical domains such as medicine, finance, law, infrastructure, and other sensitive applications that require system transparency and high user trust. The quality of these agents is typically evaluated by accuracy, sometimes extended to partial correctness. In this position paper, we argue that this focus on outcomes is insufficient as it can obscure risky agent behaviours such as skipping critical steps, hallucinating tool use, relying on outdated parametric knowledge and other means of bypassing recommended processes. Our core position is that a holistic agent evaluation must include process evaluation, especially for critical applications. We conduct a small-scale study to assess the feasibility of automatic process evaluation, present a compliance score, analyse use cases of bad and good behaviours, and offer recommendations for more holistic evaluation.- Anthology ID:
- 2026.findings-eacl.140
- Volume:
- Findings of the Association for Computational Linguistics: EACL 2026
- Month:
- March
- Year:
- 2026
- Address:
- Rabat, Morocco
- Editors:
- Vera Demberg, Kentaro Inui, Lluís Marquez
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 2678–2692
- Language:
- URL:
- https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.140/
- DOI:
- Cite (ACL):
- Milan Gritta, Debjit Paul, Xiaoguang Li, Lifeng Shang, Jun Wang, and Gerasimos Lampouras. 2026. Process Evaluation for Agentic Systems. In Findings of the Association for Computational Linguistics: EACL 2026, pages 2678–2692, Rabat, Morocco. Association for Computational Linguistics.
- Cite (Informal):
- Process Evaluation for Agentic Systems (Gritta et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.140.pdf