Farseen Shaikh

2026

Can LLMs Self-Correct Table Reasoning Errors?
Farseen Shaikh
Proceedings of the First Workshop on Structured Understanding, Retrieval, and Generation in the LLM Era (SURGeLLM 2026)

Self-correction—the ability of LLMs to detect and fix their own errors—has been studied extensively for mathematical and code reasoning, with limited prior work on table reasoning (primarily multi-agent pipelines such as Table-Critic, ACL 2025, rather than single-model structured prompting). Tables present unique challenges: errors arise from wrong cell retrieval, incorrect computation, flawed logic, and hallucination of values not present in the data. We conduct the first cross-provider single-model self-correction analysis for table reasoning across five providers (Google, Moonshot AI, Zhipu, Alibaba, MiniMax), testing five models (Gemini 3.1 Pro, Kimi K2.5, GLM 5, Qwen 3.5+, MiniMax M2.5) on WikiTableQuestions and TabFact with a multi-seed paired protocol. We propose Structured Self-Correction (SSC), a table-specific verification chain that guides models through cell verification, computation checking, logic validation, and completeness assessment. We confirm that the Accuracy-Correction Paradox (terminology from Li 2025) previously observed in math extends to tables: models with base accuracy in the mid-60s–mid-70s region benefit modestly from self-correction (multi-seed mean SCG up to +1.3% with within-seed point estimates as high as +3.4%), while stronger models above this region are systematically harmed by over-correction (multi-seed mean SCG down to -1.3%, with 95% bootstrap CIs significantly below zero). SSC reduces over-correction rates in 9 of 10 conditions, with reductions of 38–69% on TabFact. An inference-mode-controlled probe shows that SSC’s qualitative direction is robust for Qwen 3.5+ across reasoning-ON and reasoning-OFF settings, while GLM 5 exhibits a substantial mode-dependent shift, indicating that mode robustness itself is model-dependent. Stronger baselines (self-consistency, self-critic, tool-augmented arithmetic verification, majority voting, and a same-family scaling probe) further characterize where SSC helps. Ablation studies reveal that answer-aware review is essential, reasoning traces aid error detection, and iterative correction shows diminishing returns. A FinQA domain transfer probe confirms a capability floor: self-correction fails when base task competence is very low (21.5% accuracy). Our primary contribution is empirical: we characterize the conditions under which self-correction helps or harms table reasoning, providing actionable guidance for practitioners.

Co-authors

Venues

SURGeLLM1
WS1

Fix author