AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark

Lan Li; Liri Fang; Bertram Ludäscher; Vetle I Torvik

doi:10.18653/v1/2025.findings-emnlp.410

AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark

Lan Li, Liri Fang, Bertram Ludäscher, Vetle I Torvik

Abstract

Data cleaning is a time-consuming and error-prone manual process even with modern workflow tools like OpenRefine. Here, we present AutoDCWorkflow, an LLM-based pipeline for automatically generating data-cleaning workflows. The pipeline takes a raw table coupled with a data analysis purpose, and generates a sequence of OpenRefine operations designed to produce a minimal, clean table sufficient to address the purpose. Six operations address common data quality issues including format inconsistencies, type errors, and duplicates.To evaluate AutoDCWorkflow, we create a benchmark with metrics assessing answers, data, and workflow quality for 142 purposes using 96 tables across six topics. The evaluation covers three key dimensions: (1) **Purpose Answer**: can the cleaned table produce a correct answer? (2) **Column (Value)**: how closely does it match the ground truth table? (3) **Workflow (Operations)**: to what extent does the generated workflow resemble the human-curated ground truth? Experiments show that Llama 3.1, Mistral, and Gemma 2 significantly enhance data quality, outperforming the baseline across all metrics. Gemma 2-27B consistently generates high-quality tables and answers, while Gemma 2-9B excels in producing workflows that resemble human annotations.

Anthology ID:: 2025.findings-emnlp.410
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7766–7780
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.410/
DOI:: 10.18653/v1/2025.findings-emnlp.410
Bibkey:
Cite (ACL):: Lan Li, Liri Fang, Bertram Ludäscher, and Vetle I Torvik. 2025. AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 7766–7780, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark (Li et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.410.pdf
Checklist:: 2025.findings-emnlp.410.checklist.pdf

PDF Cite Search Checklist Fix data