Reza Mousavi
2026
Generalization or Memorization? Multi-Agent vs. Baseline LLMs and AutoML Models for Tabular Classification
Aida Sanatizadeh | Sorouralsadat Fatemi | Reza Mousavi | Ahmed Abbasi
Findings of the Association for Computational Linguistics: ACL 2026
Aida Sanatizadeh | Sorouralsadat Fatemi | Reza Mousavi | Ahmed Abbasi
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Models (LLMs) are increasingly used for structured tabular data, yet it remains unclear whether their performance reflects genuine reasoning or memorization of pre-training corpora. We investigate this question through a rigorous, contamination-aware evaluation of a representative modular Multi-Agent LLM (MALLM) framework against state-of-the-art AutoML systems and established baselines (TABLET, TABLLM). We evaluate eleven binary classification tasks: five pre-cutoff benchmarks likely seen during LLM pre-training and six post-cutoff datasets released after the LLM knowledge cutoff. Results show a sharp performance dichotomy: MALLM achieves competitive or superior performance on pre-cutoff datasets but substantially underperforms AutoML on post-cutoff data, exhibiting poor calibration and high variance, especially on hard-to-classify instances. By contrast, AutoML models generalize consistently and align confidence more closely with instance hardness. These findings suggest that, despite agentic scaffolding, current LLMs cannot yet replace production-grade discriminative models for tabular classification, underscoring the need for contamination-free benchmarks to accurately assess tabular reasoning capabilities.