Large Language Models for IT Automation Tasks: Are We There Yet?
Md. Mahadi Hassan, John Salvador, Akond Ashfaque Ur Rahman, Santu Karmaker
Abstract
LLMs show promise in code generation, yet their effectiveness for IT automation tasks, particularly for tools like Ansible, remains understudied. Existing benchmarks rely primarily on synthetic tasks that fail to capture the needs of practitioners who use IT automation tools. We present ExITBench (Execution-based IT Automation Benchmark), a benchmark of 126 diverse tasks (e.g., configuring servers and managing files) in which each task captures state reconciliation - a core property of IT automation tools. ExITBench evaluates LLMs’ ability to generate functional Ansible automation scripts via dynamic execution in controlled environments. We evaluate 14 open-source and 3 proprietary LLMs and find that GPT-4.1-Mini achieves the best pass@10 rate of 23.9%, while Claude-3.5-Sonnet achieves the best pass@1 performance. To explain the low performance, we analyze 1,517 execution failures across the evaluated LLMs and identify two prevalent semantic error categories: failures in state-reconciliation reasoning (42.117% combined from variable (12.287%), host (10.363%), path (10.511%), and template (8.956%) issues) and deficiencies in module-specific execution knowledge (26.203% combined from attribute & parameter (17.617%) and module (8.586%) errors). Our findings reveal key limitations in LLMs’ ability to address state reconciliation and apply specialized module knowledge, indicating that reliable IT automation with LLM-based agents need major advances in state reasoning and domain-specific execution.- Anthology ID:
- 2026.findings-acl.560
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 11534–11573
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.560/
- DOI:
- Cite (ACL):
- Md. Mahadi Hassan, John Salvador, Akond Ashfaque Ur Rahman, and Santu Karmaker. 2026. Large Language Models for IT Automation Tasks: Are We There Yet?. In Findings of the Association for Computational Linguistics: ACL 2026, pages 11534–11573, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Large Language Models for IT Automation Tasks: Are We There Yet? (Hassan et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.560.pdf