Generalization or Memorization? Multi-Agent vs. Baseline LLMs and AutoML Models for Tabular Classification

Aida Sanatizadeh, Sorouralsadat Fatemi, Reza Mousavi, Ahmed Abbasi


Abstract
Large Language Models (LLMs) are increasingly used for structured tabular data, yet it remains unclear whether their performance reflects genuine reasoning or memorization of pre-training corpora. We investigate this question through a rigorous, contamination-aware evaluation of a representative modular Multi-Agent LLM (MALLM) framework against state-of-the-art AutoML systems and established baselines (TABLET, TABLLM). We evaluate eleven binary classification tasks: five pre-cutoff benchmarks likely seen during LLM pre-training and six post-cutoff datasets released after the LLM knowledge cutoff. Results show a sharp performance dichotomy: MALLM achieves competitive or superior performance on pre-cutoff datasets but substantially underperforms AutoML on post-cutoff data, exhibiting poor calibration and high variance, especially on hard-to-classify instances. By contrast, AutoML models generalize consistently and align confidence more closely with instance hardness. These findings suggest that, despite agentic scaffolding, current LLMs cannot yet replace production-grade discriminative models for tabular classification, underscoring the need for contamination-free benchmarks to accurately assess tabular reasoning capabilities.
Anthology ID:
2026.findings-acl.1994
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
40099–40132
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1994/
DOI:
Bibkey:
Cite (ACL):
Aida Sanatizadeh, Sorouralsadat Fatemi, Reza Mousavi, and Ahmed Abbasi. 2026. Generalization or Memorization? Multi-Agent vs. Baseline LLMs and AutoML Models for Tabular Classification. In Findings of the Association for Computational Linguistics: ACL 2026, pages 40099–40132, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Generalization or Memorization? Multi-Agent vs. Baseline LLMs and AutoML Models for Tabular Classification (Sanatizadeh et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1994.pdf
Checklist:
 2026.findings-acl.1994.checklist.pdf