AIRepr: An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science

Qiuhai Zeng; Claire Jin; Xinyue Wang; Yuhan Zheng; Qunhua Li

doi:10.18653/v1/2025.findings-emnlp.539

AIRepr: An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science

Qiuhai Zeng, Claire Jin, Xinyue Wang, Yuhan Zheng, Qunhua Li

Abstract

Large language models (LLMs) are increasingly used to automate data analysis through executable code generation. Yet, data science tasks often admit multiple statistically valid solutions—for example, different modeling strategies—making it critical to understand the reasoning behind analyses, not just their outcomes. While manual review of LLM-generated code can help ensure statistical soundness, it is labor-intensive and requires expertise. A more scalable approach is to evaluate the underlying workflows—the logical plans guiding code generation. However, it remains unclear how to assess whether an LLM-generated workflow supports reproducible implementations.To address this, we present **AIRepr**, an **A**nalyst–**I**nspector framework for automatically evaluating and improving the **repr**oducibility of LLM-generated data analysis workflows. Our framework is grounded in statistical principles and supports scalable, automated assessment. We introduce two novel reproducibility-enhancing prompting strategies and benchmark them against standard prompting across 15 analyst–inspector LLM pairs and 1,032 tasks from three public benchmarks. Our findings show that workflows with higher reproducibility also yield more accurate analyses, and that reproducibility-enhancing prompts substantially improve both metrics. This work provides a foundation for transparent, reliable, and efficient human–AI collaboration in data science. Our code is publicly available: [https://github.com/Anonymous-2025-Repr/LLM-DS-Reproducibility](https://github.com/Anonymous-2025-Repr/LLM-DS-Reproducibility)

Anthology ID:: 2025.findings-emnlp.539
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10170–10201
Language:
URL:: https://preview.aclanthology.org/ingest-luhme/2025.findings-emnlp.539/
DOI:: 10.18653/v1/2025.findings-emnlp.539
Bibkey:
Cite (ACL):: Qiuhai Zeng, Claire Jin, Xinyue Wang, Yuhan Zheng, and Qunhua Li. 2025. AIRepr: An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 10170–10201, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: AIRepr: An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science (Zeng et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-luhme/2025.findings-emnlp.539.pdf
Checklist:: 2025.findings-emnlp.539.checklist.pdf

PDF Cite Search Checklist Fix data