PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows

Kazuki Kawamura, Satoshi Waki, Kei Tateno


Abstract
Multi-agent LLM workflows, which are AI systems composed of multiple role-specialized LLM calls, often outperform single prompts, but they are notoriously difficult to debug and refine. Failures can originate from subtle mistakes in intermediate artifacts that silently propagate downstream, forcing developers to read long traces and guess which agent to edit. We present PROTEA, a unified UI that closes the loop for offline, test-case–driven improvement of multi-agent workflows, enabling developers to efficiently diagnose and fix errors without manual inspection of long traces. PROTEA executes a workflow, scores intermediate artifacts with configurable evaluators, and overlays per-node states and rationales on the workflow graph to localize likely bottlenecks. To address the difficulty of preparing intermediate reference in complex systems, PROTEA performs backward node evaluation by inferring each node’s ideal expected output from terminal supervision and graph context, and comparing it with the observed node output. For a selected node, it proposes a targeted prompt patch as an editable diff, then automatically re-runs and re-evaluates the workflow to show before/after output diffs and score trajectories within the same interface. Using PROTEA, users can visually pinpoint system-wide bottlenecks at a glance, streamline remediation via semi-automated prompt patching, and immediately verify pre- and post-correction outcomes within a unified loop.
Anthology ID:
2026.acl-demo.3
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Greg Durrett, Ping Jian
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
27–35
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-demo.3/
DOI:
Bibkey:
Cite (ACL):
Kazuki Kawamura, Satoshi Waki, and Kei Tateno. 2026. PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 27–35, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows (Kawamura et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-demo.3.pdf