Krishnasuri Narayanam


2025

pdf bib
Quality Assessment of Tabular Data using Large Language Models and Code Generation
Ashlesha Akella | Akshar Kaul | Krishnasuri Narayanam | Sameep Mehta
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

Reliable data quality is crucial for downstream analysis of tabular datasets, yet rule-based validation often struggles with inefficiency, human intervention, and high computational costs. We present a three-stage framework that combines statistical inliner detection with LLM-driven rule and code generation. After filtering data samples through traditional clustering, we iteratively prompt LLMs to produce semantically valid quality rules and synthesize their executable validators through code-generating LLMs. To generate reliable quality rules, we aid LLMs with retrieval-augmented generation (RAG) by leveraging external knowledge sources and domain-specific few-shot examples. Robust guardrails ensure the accuracy and consistency of both rules and code snippets. Extensive evaluations on benchmark datasets confirm the effectiveness of our approach.

pdf bib
CodeGenWrangler: Data Wrangling task automation using Code-Generating Models
Ashlesha Akella | Abhijit Manatkar | Krishnasuri Narayanam | Sameep Mehta
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)

Assuring the data quality of tabular datasets is essential for the efficiency of the diverse tabular downstream tasks (like summarization and fact-checking). Data-wrangling tasks effectively address the challenges associated with structured data processing to improve the quality of tabular data. Traditional statistical methods handle numeric data efficiently but often fail to understand the semantic context of the textual data in tables. Deep learning approaches are resource-intensive, requiring task and dataset-specific training. Addressing these shortcomings, we present an automated system that leverages LLMs to generate executable code for data-wrangling tasks like missing value imputation, error detection, and error correction. Our system aims to identify inherent patterns in the data while leveraging external knowledge, effectively addressing both memory-independent and memory-dependent tasks.

2024

pdf bib
QUIS: Question-guided Insights Generation for Automated Exploratory Data Analysis
Abhijit Manatkar | Ashlesha Akella | Parthivi Gupta | Krishnasuri Narayanam
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

Discovering meaningful insights from a large dataset, known as Exploratory Data Analysis (EDA), is a challenging task that requires thorough exploration and analysis of the data. Automated Data Exploration (ADE) systems use goal-oriented methods with Large Language Models and Reinforcement Learning towards full automation. However, these methods require human involvement to anticipate goals that may limit insight extraction, while fully automated systems demand significant computational resources and retraining for new datasets. We introduce QUIS, a fully automated EDA system that operates in two stages: insight generation (ISGen) driven by question generation (QUGen). The QUGen module generates questions in iterations, refining them from previous iterations to enhance coverage without human intervention or manually curated examples. The ISGen module analyzes data to produce multiple relevant insights in response to each question, requiring no prior training and enabling QUIS to adapt to new datasets.