Bertram Ludäscher


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2025

pdf bib
AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark
Lan Li | Liri Fang | Bertram Ludäscher | Vetle I Torvik
Findings of the Association for Computational Linguistics: EMNLP 2025

Data cleaning is a time-consuming and error-prone manual process even with modern workflow tools like OpenRefine. Here, we present AutoDCWorkflow, an LLM-based pipeline for automatically generating data-cleaning workflows. The pipeline takes a raw table coupled with a data analysis purpose, and generates a sequence of OpenRefine operations designed to produce a minimal, clean table sufficient to address the purpose. Six operations address common data quality issues including format inconsistencies, type errors, and duplicates.To evaluate AutoDCWorkflow, we create a benchmark with metrics assessing answers, data, and workflow quality for 142 purposes using 96 tables across six topics. The evaluation covers three key dimensions: (1) **Purpose Answer**: can the cleaned table produce a correct answer? (2) **Column (Value)**: how closely does it match the ground truth table? (3) **Workflow (Operations)**: to what extent does the generated workflow resemble the human-curated ground truth? Experiments show that Llama 3.1, Mistral, and Gemma 2 significantly enhance data quality, outperforming the baseline across all metrics. Gemma 2-27B consistently generates high-quality tables and answers, while Gemma 2-9B excels in producing workflows that resemble human annotations.