Large Language Models for IT Automation Tasks: Are We There Yet?

Md. Mahadi Hassan, John Salvador, Akond Ashfaque Ur Rahman, Santu Karmaker


Abstract
LLMs show promise in code generation, yet their effectiveness for IT automation tasks, particularly for tools like Ansible, remains understudied. Existing benchmarks rely primarily on synthetic tasks that fail to capture the needs of practitioners who use IT automation tools. We present ExITBench (Execution-based IT Automation Benchmark), a benchmark of 126 diverse tasks (e.g., configuring servers and managing files) in which each task captures state reconciliation - a core property of IT automation tools. ExITBench evaluates LLMs’ ability to generate functional Ansible automation scripts via dynamic execution in controlled environments. We evaluate 14 open-source and 3 proprietary LLMs and find that GPT-4.1-Mini achieves the best pass@10 rate of 23.9%, while Claude-3.5-Sonnet achieves the best pass@1 performance. To explain the low performance, we analyze 1,517 execution failures across the evaluated LLMs and identify two prevalent semantic error categories: failures in state-reconciliation reasoning (42.117% combined from variable (12.287%), host (10.363%), path (10.511%), and template (8.956%) issues) and deficiencies in module-specific execution knowledge (26.203% combined from attribute & parameter (17.617%) and module (8.586%) errors). Our findings reveal key limitations in LLMs’ ability to address state reconciliation and apply specialized module knowledge, indicating that reliable IT automation with LLM-based agents need major advances in state reasoning and domain-specific execution.
Anthology ID:
2026.findings-acl.560
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11534–11573
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.560/
DOI:
Bibkey:
Cite (ACL):
Md. Mahadi Hassan, John Salvador, Akond Ashfaque Ur Rahman, and Santu Karmaker. 2026. Large Language Models for IT Automation Tasks: Are We There Yet?. In Findings of the Association for Computational Linguistics: ACL 2026, pages 11534–11573, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Large Language Models for IT Automation Tasks: Are We There Yet? (Hassan et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.560.pdf
Checklist:
 2026.findings-acl.560.checklist.pdf