ComfyFlow: Benchmarking LLMs for AIGC Workflow Generation

Zhenran Xu; Yiyu Wang; Yunxin Li; Muyang Ye; Yangxue; Kai Chen; Longyue Wang; Weihua Luo; Baotian Hu; Min Zhang

ComfyFlow: Benchmarking LLMs for AIGC Workflow Generation

Zhenran Xu, Yiyu Wang, Yunxin li, Muyang Ye, Yangxue, Kai Chen, Longyue Wang, Weihua Luo, Baotian Hu, Min Zhang

Abstract

Large language models (LLMs) have shown promising advancements in tackling human-level tasks, wherein generating workflows for collaborative AI systems remains a critical and challenging step. To explore this frontier, we introduce ComfyFlow, a comprehensive benchmark to evaluate current LLMs’ ability to generate executable and instruction-following AIGC workflows in ComfyUI. The dataset includes 400 diverse visual generation tasks across 20 categories, supported by 10K training examples constructed from knowledge bases, which contain detailed annotations for 2,480 nodes and 3,298 workflows. We establish a systematic evaluation protocol that quantifies performance across multiple dimensions, ranging from basic format validity to multi-level hallucination rates. Our extensive evaluations show that: 1) ComfyFlow presents a substantial challenge even for top-tier proprietary LLMs such as GPT-5.1 and the Claude series; 2) Open-source models achieve new state-of-the-art results after post-training, yet struggle with long-horizon planning as the number of nodes increases; 3) Different post-training strategies offer complementary benefits in following instructions and mitigating hallucinations. By establishing both a challenging benchmark and a principled evaluation scheme, ComfyFlow lays the foundation for developing more intelligent and reliable collaborative AI systems.

Anthology ID:: 2026.findings-acl.140
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2903–2916
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.140/
DOI:
Bibkey:
Cite (ACL):: Zhenran Xu, Yiyu Wang, Yunxin li, Muyang Ye, Yangxue, Kai Chen, Longyue Wang, Weihua Luo, Baotian Hu, and Min Zhang. 2026. ComfyFlow: Benchmarking LLMs for AIGC Workflow Generation. In Findings of the Association for Computational Linguistics: ACL 2026, pages 2903–2916, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: ComfyFlow: Benchmarking LLMs for AIGC Workflow Generation (Xu et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.140.pdf
Checklist:: 2026.findings-acl.140.checklist.pdf

PDF Cite Search Checklist Fix data