Multi-level Diagnosis and Evaluation for Robust Tabular Feature Engineering with Large Language Models

Yebin Lim, Susik Yoon


Abstract
Recent advancements in large language models (LLMs) have shown promise in feature engineering for tabular data, but concerns about their reliability persist, especially due to variability in generated outputs. We introduce a multi-level diagnosis and evaluation framework to assess the robustness of LLMs in feature engineering across diverse domains, focusing on the three main factors: key variables, relationships, and decision boundary values for predicting target classes. We demonstrate that the robustness of LLMs varies significantly over different datasets, and that high-quality LLM-generated features can improve few-shot prediction performance by up to 10.52%. This work opens a new direction for assessing and enhancing the reliability of LLM-driven feature engineering in various domains.
Anthology ID:
2025.findings-emnlp.249
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4630–4655
Language:
URL:
https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.249/
DOI:
10.18653/v1/2025.findings-emnlp.249
Bibkey:
Cite (ACL):
Yebin Lim and Susik Yoon. 2025. Multi-level Diagnosis and Evaluation for Robust Tabular Feature Engineering with Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 4630–4655, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Multi-level Diagnosis and Evaluation for Robust Tabular Feature Engineering with Large Language Models (Lim & Yoon, Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.249.pdf
Checklist:
 2025.findings-emnlp.249.checklist.pdf