Kaustubh S. Bukkapatnam

2026

TabFaith: Benchmarking and Improving Structural Faithfulness in LLM Table Summarization
Kaustubh S. Bukkapatnam | Sohum Mehta
Proceedings of the First Workshop on Structured Understanding, Retrieval, and Generation in the LLM Era (SURGeLLM 2026)

When large language models (LLMs) summarize tabular data, they produce fluent but systematically unfaithful text—hallucinating numerical values, misattributing entities to rows or columns, fabricating comparative rankings, and conflating temporal references. Existing faithfulness metrics (BLEU, PARENT, BERTScore) are poorly correlated with human judgments of structural faithfulness (r ≤0.60) because they are agnostic to the table’s schema and cell structure. We introduce TABFAITH, a benchmark of 2,400 (table, summary, error annotation) triples across five structural error types, built from ToTTo and a new enterprise table summarization dataset (TabSum-Ent) covering financial reports, clinical notes, and operational dashboards. We further propose STAF (Structural Table-Aware Faithfulness), a reference-free metric that decomposes faithfulness verification into cell-level claim alignment using natural language inference over table cells. STAF achieves r = 0.94 with human faithfulness judgments—a +0.34 improvement over PARENT (r = 0.60) and +0.70 over BLEU (r = 0.24). Guided by STAF’s fine-grained signal, we design CAVE (Cell-Anchored Verification and Editing), a training-free post-processing method that identifies unfaithful claims, traces them to specific table cells, and re-generates the offending spans. CAVE improves STAF scores by +0.14 on average across five LLMs on both ToTTo and TabSum-Ent, with the largest gains for numerical errors (+0.17)—the dominant error type for smaller models.

pdf bib abs

SchemaScope: How Join-Hop Depth Breaks Text-to-SQL in Large Language Models, and a Decomposition-Based Remedy
Kaustubh S. Bukkapatnam | Rayan Malik
Proceedings of the First Workshop on Structured Understanding, Retrieval, and Generation in the LLM Era (SURGeLLM 2026)

Large language models (LLMs) achieve impressive accuracy on standard Text-to-SQL benchmarks such as Spider and BIRD, yet enterprise databases, with hundreds of tables and complex foreign key graphs, remain a practical bottleneck. We hypothesize that a single, measurable property drives most of this gap: the join-hop depth (h) of the query, defined as the number of foreign key edges that must be traversed to gather all required columns. We introduce the Join-Hop Depth (JHD) benchmark, 410 human-annotated questions stratified by h ∈ {1, …, 6} over 12 enterprise-scale schemas. Experiments on five frontier LLMs confirm a sharp accuracy cliff: all models exceed 80% at h = 1 but fall below 40% at h = 4 and below 25% at h = 6, the typical depth of real enterprise analytics queries. To address this, we propose SchemaScope, a decomposition framework that partitions deep queries into a sequence of sub-queries with h ≤ 2, executes them independently, and merges the results. SchemaScope raises execution accuracy from 46.8% to 67.3% on JHD (GPT-4o, h ≥ 3) and improves execution accuracy by +9.3 percentage points on the BIRD development set. Error analysis shows that decomposition eliminates wrong join path errors, the dominant failure mode at high h, and shifts the residual error budget toward condition and aggregation mistakes that are amenable to existing post-processing methods.

pdf bib abs

HalluTrace: Causal Attribution and Source-Targeted Decoding for Hallucination in Large Vision-Language Models
Kaustubh S. Bukkapatnam
Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR)

Object hallucination in large vision-language models (LVLMs) is well-documented, but the mechanisms that produce it remain poorly understood. We introduce HALLUTRACE, a causal attribution framework that decomposes hallucination into three distinct sources: (VGF) visual grounding failure, where the visual encoder produces a representation insufficient to identify the target object; (LPD) language prior dominance, where the language model overrides a correct visual signal with a statistically-driven prediction; and (CMC) cross-modal conflict, where visual and linguistic signals are irreconcilably inconsistent and the model resolves the conflict incorrectly. We operationalise these sources via causal component ablations: intervening on fvis, fproj, and fLM independently and measuring the change in CHAIR score. Experiments on five LVLMs show that attribution patterns are object-category-specific and model-consistent: person/vehicle hallucinations are predominantly LPD (≥52%), food/furniture hallucinations are predominantly VGF (≥44%), and animal hallucinations split between VGF and CMC. Guided by these attributions, we design HAD (Hallucination-Aware Decoding), a unified decoding strategy that applies source-targeted interventions: visual signal amplification for VGF, language prior suppression for LPD, and contrastive re-weighting for CMC. HAD reduces CHAIRI by 3.7–5.6 points and improves POPE F1 by 1.9–3.1 points over LLaVA-1.5, outperforming VCD and ICD on all three benchmarks (CHAIR, POPE, MME) without any additional training. We further prove that the attribution-decoding correspondence is tight: the CHAIR improvement from HAD is linearly predictable from the VGF attribution share (r = 0.86, p < 10−6), validating the causal framework.

pdf bib abs

The Compositional Grounding Gap: Why Vision-Language Models Fail at Relational Reasoning and How to Fix It
Kaustubh S. Bukkapatnam
Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR)

Large vision-language models (LVLMs) achieve strong performance on many multimodal tasks, yet consistently fail at compositional relational reasoning—distinguishing "the cat on the mat" from "the mat on the cat." We provide a formal explanation for this failure. We prove that any vision-language alignment operating on pooled (order-invariant) visual features contains compositional blind spots: semantically distinct scenes that map to identical representations. We show that the number of blind spots grows factorially with scene complexity, establishing a fundamental limit on pooled-feature architectures. Motivated by this analysis, we propose REGROUND, a training-free, test-time method that re-introduces spatial structure into alignment by performing relation-guided cross-attention over spatial visual tokens, directed by a lightweight parse of the text query. Without any fine-tuning, REGROUND improves compositional accuracy by +8.6 points on Winoground, +8.4 on ARO-Relation, +6.4 on SugarCrepe, and +8.4 on VSR when applied to LLaVA-1.5, and provides consistent gains across other LVLMs. Ablation studies confirm that each component—parse guidance, token-level attention, and relation masking—contributes significantly.

Co-authors

Rayan Malik 1
Sohum Mehta 1

Venues

Fix author