ComfyFlow: Benchmarking LLMs for AIGC Workflow Generation
Zhenran Xu, Yiyu Wang, Yunxin li, Muyang Ye, Yangxue, Kai Chen, Longyue Wang, Weihua Luo, Baotian Hu, Min Zhang
Abstract
Large language models (LLMs) have shown promising advancements in tackling human-level tasks, wherein generating workflows for collaborative AI systems remains a critical and challenging step. To explore this frontier, we introduce ComfyFlow, a comprehensive benchmark to evaluate current LLMs’ ability to generate executable and instruction-following AIGC workflows in ComfyUI. The dataset includes 400 diverse visual generation tasks across 20 categories, supported by 10K training examples constructed from knowledge bases, which contain detailed annotations for 2,480 nodes and 3,298 workflows. We establish a systematic evaluation protocol that quantifies performance across multiple dimensions, ranging from basic format validity to multi-level hallucination rates. Our extensive evaluations show that: 1) ComfyFlow presents a substantial challenge even for top-tier proprietary LLMs such as GPT-5.1 and the Claude series; 2) Open-source models achieve new state-of-the-art results after post-training, yet struggle with long-horizon planning as the number of nodes increases; 3) Different post-training strategies offer complementary benefits in following instructions and mitigating hallucinations. By establishing both a challenging benchmark and a principled evaluation scheme, ComfyFlow lays the foundation for developing more intelligent and reliable collaborative AI systems.- Anthology ID:
- 2026.findings-acl.140
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 2903–2916
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.140/
- DOI:
- Cite (ACL):
- Zhenran Xu, Yiyu Wang, Yunxin li, Muyang Ye, Yangxue, Kai Chen, Longyue Wang, Weihua Luo, Baotian Hu, and Min Zhang. 2026. ComfyFlow: Benchmarking LLMs for AIGC Workflow Generation. In Findings of the Association for Computational Linguistics: ACL 2026, pages 2903–2916, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- ComfyFlow: Benchmarking LLMs for AIGC Workflow Generation (Xu et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.140.pdf