Reliable and Cost-Effective Exploratory Data Analysis via Graph-Guided RAG

Mossad Helali, Yutai Luo, Tae Jun Ham, Jim Plotts, Ashwin Chaugule, Jichuan Chang, Parthasarathy Ranganathan, Essam Mansour


Abstract
Automating Exploratory Data Analysis (EDA) is critical for accelerating the workflow of data scientists. While Large Language Models (LLMs) offer a promising solution, current LLM-only approaches often exhibit limited accuracy and code reliability on less-studied or private datasets. Moreover, their effectiveness significantly diminishes with open-source LLMs compared to proprietary ones, limiting their usability in enterprises that prefer local models for privacy and cost. To address these limitations, we introduce RAGvis: a novel two-stage graph-guided Retrieval-Augmented Generation (RAG) framework. RAGvis first builds a base knowledge graph (KG) of EDA notebooks and enriches it with structured EDA operation semantics. These semantics are extracted by an LLM guided by our empirically-developed EDA operations taxonomy. Second, in the online generation stage for new datasets, RAGvis retrieves relevant operations from the KG, aligns them to the dataset’s structure, refines them with LLM reasoning, and then employs a self-correcting agent to generate executable Python code. Experiments on two benchmarks demonstrate that RAGvis significantly improves code executability (pass rate), semantic accuracy, and visual quality in generated operations. This enhanced performance is achieved with substantially lower token usage compared to LLM-only baselines. Notably, our approach enables smaller, open-source LLMs to match the performance of proprietary models, presenting a reliable and cost-effective pathway for automated EDA code generation.
Anthology ID:
2025.emnlp-main.836
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
16547–16564
Language:
URL:
https://preview.aclanthology.org/name-variant-enfa-fane/2025.emnlp-main.836/
DOI:
10.18653/v1/2025.emnlp-main.836
Bibkey:
Cite (ACL):
Mossad Helali, Yutai Luo, Tae Jun Ham, Jim Plotts, Ashwin Chaugule, Jichuan Chang, Parthasarathy Ranganathan, and Essam Mansour. 2025. Reliable and Cost-Effective Exploratory Data Analysis via Graph-Guided RAG. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 16547–16564, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Reliable and Cost-Effective Exploratory Data Analysis via Graph-Guided RAG (Helali et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/name-variant-enfa-fane/2025.emnlp-main.836.pdf
Checklist:
 2025.emnlp-main.836.checklist.pdf