Workshop on Structured Understanding, Retrieval, and Generation in the LLM Era (2026)
up
Proceedings of the First Workshop on Structured Understanding, Retrieval, and Generation in the LLM Era (SURGeLLM 2026)
Proceedings of the First Workshop on Structured Understanding, Retrieval, and Generation in the LLM Era (SURGeLLM 2026)
Vivek Gupta | Kaize Ding | Harsha Kokel | Yue Zhao | Amit Agarwal | Yu Wang | Michael Glass | Yu Zhang | Kavitha Srinivas | Xiusi Chen | Oktie Hassanzadeh | Qi Zhu | Shuaichen Chang | Yuan Luo
Vivek Gupta | Kaize Ding | Harsha Kokel | Yue Zhao | Amit Agarwal | Yu Wang | Michael Glass | Yu Zhang | Kavitha Srinivas | Xiusi Chen | Oktie Hassanzadeh | Qi Zhu | Shuaichen Chang | Yuan Luo
UNJOIN: Enhancing Multi-Table Text-to-SQL Generation via Schema Simplification
Poojah Ganesan | Rajat Aayush Jha | Dan Roth | Vivek Gupta
Poojah Ganesan | Rajat Aayush Jha | Dan Roth | Vivek Gupta
Recent advances in large language models (LLMs) have greatly improved Text-to-SQL performance for single-table queries. But, it remains challenging in multi-table databases due to complex schema and relational operations. Existing methods often struggle with retrieving the right tables and columns, generating accurate JOINs and UNIONs, and generalizing across diverse schemas. To address these issues, we introduce UNJOIN, a two-stage framework that decouples the retrieval of schema elements from SQL logic generation. In the first stage, we merge the column names of all tables in the database into a single-table representation by prefixing each column with its table name. This allows the model to focus purely on accurate retrieval without being distracted by the need to write complex SQL logic. In the second stage, the SQL query is generated on this simplified schema and mapped back to the original schema by reconstructing JOINs, UNIONs, and relational logic. Evaluations on SPIDER and BIRD datasets show that UNJOIN matches or exceeds the state-of-the-art baselines. UNJOIN uses only schema information, which does not require data access or fine-tuning, making it scalable and adaptable across databases. Our code is available at: https://github.com/coral-lab-asu/unjoin
The Mighty ToRR: A Benchmark for Table Reasoning and Robustness in LLMs
Shir Ashury-Tahan | Yifan Mai | Rajmohan C | Ariel Gera | Yotam Perlitz | Asaf Yehudai | Elron Bandel | Leshem Choshen | Eyal Shnarch | Percy Liang | Michal Shmueli-Scheuer
Shir Ashury-Tahan | Yifan Mai | Rajmohan C | Ariel Gera | Yotam Perlitz | Asaf Yehudai | Elron Bandel | Leshem Choshen | Eyal Shnarch | Percy Liang | Michal Shmueli-Scheuer
Despite its real-world significance, model performance on tabular data remains underexplored, leaving uncertainty about which model to rely on and which prompt configuration to adopt. To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, measuring model performance and robustness on table-related tasks. The benchmark includes 10 datasets that cover different types of table reasoning capabilities across varied domains. ToRR goes beyond model performance rankings, and is designed to reflect whether models can handle tabular data consistently and robustly, across a variety of common table representation formats. We present a leaderboard as well as comprehensive analyses of the results of leading models over ToRR. Our results reveal a striking pattern of brittle model behavior, where even strong models are unable to perform robustly on tabular data tasks. We further find that no single table format consistently yields superior performance. However, evaluating models across multiple formats is essential for a reliable assessment of their capabilities. Moreover, we show that the reliability boost from testing multiple prompts can be equivalent to adding more test examples. Overall, our findings show that reasoning over table tasks remains a significant challenge. The leaderboard, data and code are publicly available.
Ontology-Free General-Domain Knowledge Graph-to-Text Generation Dataset Synthesis using Large Language Model
Daehui Kim | Deokhyung Kang | Sangwon Ryu | Gary Lee
Daehui Kim | Deokhyung Kang | Sangwon Ryu | Gary Lee
Knowledge Graph-to-Text (G2T) generation involves verbalizing structured knowledge graphs into natural language text. Recent advancements in Pretrained Language Models (PLMs) have improved G2T performance, but their effectiveness relies on datasets with precise graph-text alignment. However, the scarcity of high-quality, general-domain G2T generation datasets restricts progress in the general-domain G2T generation research. To address this issue, we introduce Wikipedia Ontology-Free Graph-text dataset (WikiOFGraph), a new large-scale G2T dataset generated using a novel method that leverages Large Language Models (LLMs) and Data-QuestEval. Our dataset, which contains 5.85M general-domain graph-text pairs, offers high graph-text consistency without reliance on external ontologies. Experimental results demonstrate that PLM fine-tuned on WikiOFGraph outperforms those trained on other datasets across various evaluation metrics. Our method proves to be a scalable and effective solution for generating high-quality G2T data, significantly advancing the field of G2T generation.
Output-Space Search: Targeting LLM Generations in a Frozen Encoder-Defined Output Space
Tobias Materzok
Tobias Materzok
We introduce Output-Space Search (OS-Search), which turns LLM generation into endpoint search. An outer loop selects a target z* in a frozen encoder-defined 3D output space Z, and a retrieval-grounded policy trained with sequence-level RL generates outputs whose coordinates land near z* under standard autoregressive decoding. This enables parallel sweeps and black-box optimization in Z without path-dependent token/program search. On stories, sweeping Z (text) yields 3.1x higher LLM-scored diversity than prompt-chaining. On code, Bayesian optimization over Z (code) improves an objective withheld from the controller under matched inference budgets while preserving validity.
TreeDiff: AST-Guided Code Generation with Diffusion LLMs
Yiming Zeng | Jinghan Cao | Zexin Li | Yiming Chen | Tao Ren | Zhuochun Li | Dawei Xiang | Xidong Wu | Shangqian Gao | Tingting Yu
Yiming Zeng | Jinghan Cao | Zexin Li | Yiming Chen | Tao Ren | Zhuochun Li | Dawei Xiang | Xidong Wu | Shangqian Gao | Tingting Yu
Code generation is increasingly critical for real-world applications. Still, diffusion-based large language models continue to struggle with this demand. Unlike free-form text, code requires syntactic precision; even minor structural inconsistencies can render a program non-executable. Existing diffusion-based large language models rely on random token masking for corruption, leading to two key failures: they lack awareness of syntactic boundaries during the iterative denoising process, and they fail to capture the long-range hierarchical dependencies essential for program correctness.We propose TreeDiff to address both issues. Specifically, we propose a syntax-aware diffusion framework that incorporates structural priors from Abstract Syntax Tree (AST) into the corruption process. Instead of masking individual tokens at random, we selectively mask tokens belonging to key AST nodes. By aligning the corruption process with the underlying structure of code, our method encourages the model to internalize the compositional nature of programming languages, enabling it to reconstruct programs that respect grammatical boundaries and capture long-range dependencies. Our method achieves a 13.3% relative improvement over the random masking training method, demonstrating its effectiveness in code generation task by leveraging underlying structures.
Chart-RL: Generalized Chart Comprehension via Reinforcement Learning with Verifiable Rewards
Xin Zhang | Xingyu Li | Rongguang Wang | Ruizhong Miao | Zheng Wang | Yuying Wang | Dan Roth | Chenyang Li
Xin Zhang | Xingyu Li | Rongguang Wang | Ruizhong Miao | Zheng Wang | Yuying Wang | Dan Roth | Chenyang Li
Accurate chart comprehension represents a critical challenge in advancing multimodal learning systems, as extensive information is compressed into structured visual representations. However, existing vision-language models (VLMs) frequently struggle to generalize on unseen charts because it requires abstract, symbolic, and quantitative reasoning over structured visual representations. In this work, we introduce Chart-RL, an effective reinforcement learning (RL) method that employs mathematically verifiable rewards to enhance chart question answering in VLMs. Our experiments demonstrate that Chart-RL consistently outperforms supervised fine-tuning (SFT) across different chart understanding benchmarks, achieving relative improvements of 16.7% on MultiChartQA, and 11.5% on ChartInsights. We conduct robustness analysis, where Chart-RL achieves enhanced performance in 18 of 25 perturbed chart categories, demonstrating strong consistency and reasoning capability across visual variations. Furthermore, we demonstrate that task difficulty and inherent complexity are more critical than data quantity in RL training. For instance, Chart-RL trained on merely 10 complex chart-query examples significantly outperforms models trained on over 6,000 simple examples. Additionally, training on challenging reasoning tasks not only improves in-domain generalization relative to simpler tasks, but also facilitate strong transfer to out-of-domain visual mathematical problems.
RSAT: Structured Attribution Makes Small Language Models Faithful Table Reasoners
Jugal Gajjar | Kamalasankari Subramaniakuppusamy
Jugal Gajjar | Kamalasankari Subramaniakuppusamy
When a language model answers a table question, users have no way to verify which cells informed which reasoning steps. We introduce RSAT, a method that trains small language models (SLMs, 1–8B) to produce step-by-step reasoning with cell-level citations grounded in table evidence. Phase 1 (SFT) teaches a structured JSON output format from verified reasoning traces. Phase 2 (GRPO) optimizes a composite reward centered on NLI-based faithfulness, alongside citation validity and parsimony. Across six models from two families—Qwen2.5 (1.5B/3B/7B) and Llama3 (1B/3B/8B)—RSAT improves faithfulness 3.7× over SFT alone (0.224→0.826), with near-perfect citation validity (0.992). Post-hoc attribution collapses below 13% format success, confirming that attribution must be integrated into reasoning, not retrofitted. Ablations show the faithfulness reward is essential: removing it drops faithfulness from 0.97 to 0.03.
Framework of Thoughts: A Foundation Framework for Dynamic and Optimized Reasoning based on Chains, Trees, and Graphs
Felix Fricke | Simon Malberg | Georg Groh
Felix Fricke | Simon Malberg | Georg Groh
Prompting schemes such as Chain of Thought, Tree of Thoughts, and Graph of Thoughts can significantly enhance the reasoning capabilities of large language models. However, most existing schemes require users to define static, problem-specific reasoning structures that lack adaptability to dynamic or unseen problem types. Additionally, these schemes are often under-optimized in terms of hyperparameters, prompts, runtime, and prompting cost. To address these limitations, we introduce Framework of Thoughts (FoT) – a general-purpose foundation framework for implementing and optimizing dynamic reasoning schemes. FoT comes with built-in features for hyperparameter tuning, prompt optimization, parallel execution, and intelligent caching, unlocking the latent performance potential of reasoning schemes. We demonstrate FoT’s capabilities by implementing three popular schemes – Tree of Thoughts, Graph of Thoughts, and ProbTree – within FoT. We empirically show that FoT enables significantly faster execution, reduces costs, and achieves better task scores through optimization. We release our codebase to facilitate the development of future dynamic and efficient reasoning schemes.
TabGuard: Agentic LLM Orchestration for Adaptive Tabular Anomaly Detection via Dynamic Validator Selection and Generation
Srihari Unnikrishnan | Minghua Ma
Srihari Unnikrishnan | Minghua Ma
Tabular anomaly detection is challenging because real-world tables contain heterogeneous columns, ranging from structured identifiers to free-form text. Existing methods face a fundamental trilemma: rule-based systems require extensive manual configuration and fail on novel schemas; statistical methods scale efficiently but miss semantic errors; and LLM-based approaches understand semantics but incur prohibitive per-cell inference costs. No prior method simultaneously addresses semantic heterogeneity, domain-specific validation rules, and enterprise-scale processing.We introduce TabGuard, an agentic framework that resolves this trilemma through semantic routing. Using LLM function calling, the system analyzes a small sample of each column and dynamically selects the most effective validation strategy, routing to a regex-based validator for syntactic patterns, a code-generation validator for domain-specific rules (such as Luhn checksums for credit cards), or an embedding-based validator for distributional outliers. This architecture decouples expensive cognitive reasoning (O(m) LLM calls for m columns) from scalable programmatic execution, enabling deployment on enterprise datasets without per-cell inference.
StructSurvey: Structured Agentic Retrieval for Automated Survey Paper Generation
Paolo Pedinotti | Enrico Santus
Paolo Pedinotti | Enrico Santus
The rapid growth of scientific publications makes it increasingly difficult to track and synthesize research progress. While Large Language Models (LLMs) can support automated survey generation, existing methods retrieve unstructured data and require models to infer conceptual, methodological, and taxonomic relations from raw text at generation time. We introduce STRUCTSURVEY, a hierarchical multiagent framework that shifts structural reasoning from generation to retrieval by dynamically constructing graph-based representations of entities, relations, and topical taxonomies. We evaluate STRUCTSURVEY on a new referencegrounded benchmark of ACL survey papers for reproducible long-form scientific summarization. Compared with embedding-only retrieval baselines, STRUCTSURVEY improves ROUGE1 recall by +2.9 and ROUGE-2 recall by +1.0 on average, without reducing precision. It also improves LLM-as-a-Judge ratings for logical structure, depth, and synthesis, showing that explicit structural retrieval yields surveys closer to human-written organization and reasoning.
Modeling LLM Agent Reviewer Dynamics in Elo-Ranked Review System
Hsiang-Wei Huang | Junbin Lu | Kuang-Ming Chen | Jianxu Shangguan | Jenq-Neng Hwang
Hsiang-Wei Huang | Junbin Lu | Kuang-Ming Chen | Jianxu Shangguan | Jenq-Neng Hwang
In this work, we explore the Large Language Model (LLM) agent reviewer dynamics in an Elo-ranked review system using real-world conference paper submissions. Multiple LLM agent reviewers with different personas engage in multi round review interactions moderated by an Area Chair. We compare a baseline setting with conditions that incorporate Elo ratings and reviewer memory. Our simulation results showcase several interesting findings, including how incorporating Elo improves Area Chair decision accuracy, as well as reviewers’ adaptive review strategies that exploits our Elo system without improving review effort. These findings show how the Elo system affects peer review and offer insights for improving AI conference evaluation. Our code is available at https://github.com/hsiangwei0903/EloReview.
DSMentor: Curriculum-Guided Inference with Online Memory for Data-Science LLM Agents
He Wang | Alexander Hanbo Li | Yiqun Hu | Sheng Zhang | Hideo Kobayashi | Jiani Zhang | Henghui Zhu | Chung-Wei Hang | Patrick Ng
He Wang | Alexander Hanbo Li | Yiqun Hu | Sheng Zhang | Hideo Kobayashi | Jiani Zhang | Henghui Zhu | Chung-Wei Hang | Patrick Ng
Large language model (LLM) agents have shown strong capabilities in generating code to solve complex data science problems, yet they often overlook the impact of task order during inference. We present DSMentor, an inference-time optimization framework that applies curriculum learning—progressing from easier to harder tasks—to enhance LLM performance on challenging data science tasks. Guided by a mentor and supported by a growing long-term memory, DSMentor organizes problems by difficulty, retains prior experiences, and leverages them to guide subsequent reasoning. Extensive experiments on DSEval and QRData benchmarks show that DSMentor with Claude-3.5-Sonnet improves pass rates by up to 5.2% over baseline agents and achieves an 8.8% gain over GPT-4 with Program-of-Thoughts prompting. These results highlight the effectiveness of curriculum-based inference strategies in advancing LLM agents.
Asking language models how to represent data for fine-tuning
Usneek Singh | Ananya Singha | Abhijeet Awasthi | Sumit Gulwani | Aditya Kanade | Vu Le | Mukul Singh | Gust Verbruggen
Usneek Singh | Ananya Singha | Abhijeet Awasthi | Sumit Gulwani | Aditya Kanade | Vu Le | Mukul Singh | Gust Verbruggen
Language models are often used for tasks involving structured data like tables and graphs, but there is no principled approach for choosing the best format to represent such data for fine-tuning. We address this in three steps. First, we show that format choice remains important even after fine-tuning; models learn more efficiently with specific formats rather than adapting to any format. Second, we show that a pre-trained model can suggest its own candidate formats by auto-completing partial prompts, reducing reliance on developer intuition. Third, and most importantly, we demonstrate that base model performance across formats reliably predicts post-fine-tuning performance: the format that performs best before fine-tuning remains among the top candidates after fine-tuning in 16 out of 18 settings across three data structure types, three models, and six tasks. This finding allows format selection to be done via inference alone, avoiding costly trial-and-error fine-tuning runs.
TabBridge: Bridging Structure and Context for Accurate Table Reasoning
Jeongwoo Lee | Eunsoo Lee | Jihie Kim
Jeongwoo Lee | Eunsoo Lee | Jihie Kim
Table reasoning remains challenging for Large Language Models (LLMs) as it requires integrating structured tabular information with natural language questions. Previous SQL-based approaches rely on surface-level alignment between question keywords and column headers, often generating queries with spurious or missing column mappings. We introduce TabBridge, a framework that incorporates both structural and contextual information for accurate table reasoning. TabBridge first generates a unified textual representation called Table Specification (TabSpec), preserving the structural information through row and column analysis. In order to ensure accuracy and consistency, we also employ a reconstruction-based evaluation mechanism to verify and refine the generated TabSpec. TabSpec is subsequently used to generate SQL aligned with the contextual intent of the question, enabling accurate interpretation of column semantics that are often overlooked by previous approaches.Across three public benchmarks, TabBridge shows consistent improvements over previous SQL-based methods, achieving 73.94% accuracy on WikiTableQuestions (+5.3 pp over the previous state of the art). TabBridge also demonstrates robust performance across diverse LLM backbones, confirming its generalizability across model architectures. Our code is available at https://github.com/raylee0519/TabBridge.
Multi-step reasoning in large language models (LLMs) is typically expressed as unstructured text, making intermediate states difficult to organize, verify, and revise explicitly. This limitation often leads to redundant reasoning paths, error accumulation, and limited controllability in complex tasks. We propose Map-of-Actions (MoA), a neuro-symbolic reasoning framework that treats reasoning as operations over an explicit structured state space. MoA represents intermediate states as a multi-labeled graph, in which each node corresponds to a semantically labeled reasoning unit. This representation provides LLMs with structured memory, explicit state transitions, and flexible interfaces to external tools. Experiments on multiple complex question answering (QA) benchmarks show that MoA consistently outperforms strong baselines, improving accuracy by up to 17.9 percentage points.
Routing End User Queries to Enterprise Databases
Saikrishna Sudarshan | Tanay Kulkarni | Manasi Patwardhan | Lovekesh Vig | Ashwin Srinivasan | Tanmay Tulsidas Verlekar
Saikrishna Sudarshan | Tanay Kulkarni | Manasi Patwardhan | Lovekesh Vig | Ashwin Srinivasan | Tanmay Tulsidas Verlekar
We address the task of routing natural language queries in multi-database enterprise environments. We construct realistic benchmarks by extending existing NL-to-SQL datasets. Our study shows that routing becomes increasingly challenging with larger, domain-overlapping DB repositories and ambiguous queries, motivating the need for more structured and robust reasoning-based solutions. By explicitly modelling schema coverage, structural connectivity, and fine-grained semantic alignment, the proposed modular, reasoning-driven re-ranking strategy consistently outperforms embedding-only and direct LLM-prompting baselines across all the metrics.
SchemaScope: How Join-Hop Depth Breaks Text-to-SQL in Large Language Models, and a Decomposition-Based Remedy
Kaustubh S. Bukkapatnam | Rayan Malik
Kaustubh S. Bukkapatnam | Rayan Malik
Large language models (LLMs) achieve impressive accuracy on standard Text-to-SQL benchmarks such as Spider and BIRD, yet enterprise databases, with hundreds of tables and complex foreign key graphs, remain a practical bottleneck. We hypothesize that a single, measurable property drives most of this gap: the join-hop depth (h) of the query, defined as the number of foreign key edges that must be traversed to gather all required columns. We introduce the Join-Hop Depth (JHD) benchmark, 410 human-annotated questions stratified by h ∈ {1, …, 6} over 12 enterprise-scale schemas. Experiments on five frontier LLMs confirm a sharp accuracy cliff: all models exceed 80% at h = 1 but fall below 40% at h = 4 and below 25% at h = 6, the typical depth of real enterprise analytics queries. To address this, we propose SchemaScope, a decomposition framework that partitions deep queries into a sequence of sub-queries with h ≤ 2, executes them independently, and merges the results. SchemaScope raises execution accuracy from 46.8% to 67.3% on JHD (GPT-4o, h ≥ 3) and improves execution accuracy by +9.3 percentage points on the BIRD development set. Error analysis shows that decomposition eliminates wrong join path errors, the dominant failure mode at high h, and shifts the residual error budget toward condition and aggregation mistakes that are amenable to existing post-processing methods.
Generalization in Graph Reasoning: A Systematic Comparison of LLM Training Approaches
Sola Shirai | Kavitha Srinivas | Julian Dolby | Michael Katz | Shirin Sohrabi | Horst Samulowitz
Sola Shirai | Kavitha Srinivas | Julian Dolby | Michael Katz | Shirin Sohrabi | Horst Samulowitz
For large language models (LLMs), reasoning over graphs can help solve many problems. Prior work has tried to improve LLM graph reasoning through different training methods, but the merits of such approaches remain unclear and the limitations of each approach with respect to generalizability of reasoning are often not thoroughly explored. In this paper we systematically compare the ability of LLMs to learn fundamental graph tasks across a variety of training methods and their ability to generalize out of distribution across various dimensions. We highlight key tradeoffs between training methods, e.g., training specialized graph encoders and fusing their embeddings with LLMs consistently collapses in terms of generalizability; however, no single method shows clear superiority across all dimensions of generalizability, regardless of the size of the model.
Self-correction—the ability of LLMs to detect and fix their own errors—has been studied extensively for mathematical and code reasoning, with limited prior work on table reasoning (primarily multi-agent pipelines such as Table-Critic, ACL 2025, rather than single-model structured prompting). Tables present unique challenges: errors arise from wrong cell retrieval, incorrect computation, flawed logic, and hallucination of values not present in the data. We conduct the first cross-provider single-model self-correction analysis for table reasoning across five providers (Google, Moonshot AI, Zhipu, Alibaba, MiniMax), testing five models (Gemini 3.1 Pro, Kimi K2.5, GLM 5, Qwen 3.5+, MiniMax M2.5) on WikiTableQuestions and TabFact with a multi-seed paired protocol. We propose Structured Self-Correction (SSC), a table-specific verification chain that guides models through cell verification, computation checking, logic validation, and completeness assessment. We confirm that the Accuracy-Correction Paradox (terminology from Li 2025) previously observed in math extends to tables: models with base accuracy in the mid-60s–mid-70s region benefit modestly from self-correction (multi-seed mean SCG up to +1.3% with within-seed point estimates as high as +3.4%), while stronger models above this region are systematically harmed by over-correction (multi-seed mean SCG down to -1.3%, with 95% bootstrap CIs significantly below zero). SSC reduces over-correction rates in 9 of 10 conditions, with reductions of 38–69% on TabFact. An inference-mode-controlled probe shows that SSC’s qualitative direction is robust for Qwen 3.5+ across reasoning-ON and reasoning-OFF settings, while GLM 5 exhibits a substantial mode-dependent shift, indicating that mode robustness itself is model-dependent. Stronger baselines (self-consistency, self-critic, tool-augmented arithmetic verification, majority voting, and a same-family scaling probe) further characterize where SSC helps. Ablation studies reveal that answer-aware review is essential, reasoning traces aid error detection, and iterative correction shows diminishing returns. A FinQA domain transfer probe confirms a capability floor: self-correction fails when base task competence is very low (21.5% accuracy). Our primary contribution is empirical: we characterize the conditions under which self-correction helps or harms table reasoning, providing actionable guidance for practitioners.
Mixed-Policy GRPO for Text-to-SQL with Off-Policy Data Generation
Marko Sterbentz | Michael Glass | Nhan H Pham | Dharmashankar Subramanian | Kristian J Hammond
Marko Sterbentz | Michael Glass | Nhan H Pham | Dharmashankar Subramanian | Kristian J Hammond
Recent advances in text-to-SQL have shown that methods such as Group Relative Policy Optimization (GRPO) can substantially improve reasoning performance, but these approaches remain inherently on-policy, limiting their ability to incorporate novel reasoning patterns. In this work, we address this limitation by leveraging existing datasets to generate high-quality off-policy rollouts, enabling mixed-policy training that exposes models to diverse and informative reasoning trajectories. We present the first application of mixed-policy GRPO to the text-to-SQL domain and introduce a systematic study of off-policy data generation strategies for this setting, including a novel method, Iterative Error Correction (IEC), which iteratively refines model outputs through targeted feedback. Our experiments show that mixed-policy GRPO outperforms both base models and on-policy GRPO, yielding average improvements of +4.7% over base models and +4.1% over on-policy GRPO across the Spider and BIRD benchmarks. Gains are particularly strong on BIRD, reaching up to +7.3% over base models and +4.5% over on-policy GRPO.
TabFaith: Benchmarking and Improving Structural Faithfulness in LLM Table Summarization
Kaustubh S. Bukkapatnam | Sohum Mehta
Kaustubh S. Bukkapatnam | Sohum Mehta
When large language models (LLMs) summarize tabular data, they produce fluent but systematically unfaithful text—hallucinating numerical values, misattributing entities to rows or columns, fabricating comparative rankings, and conflating temporal references. Existing faithfulness metrics (BLEU, PARENT, BERTScore) are poorly correlated with human judgments of structural faithfulness (r ≤0.60) because they are agnostic to the table’s schema and cell structure. We introduce TABFAITH, a benchmark of 2,400 (table, summary, error annotation) triples across five structural error types, built from ToTTo and a new enterprise table summarization dataset (TabSum-Ent) covering financial reports, clinical notes, and operational dashboards. We further propose STAF (Structural Table-Aware Faithfulness), a reference-free metric that decomposes faithfulness verification into cell-level claim alignment using natural language inference over table cells. STAF achieves r = 0.94 with human faithfulness judgments—a +0.34 improvement over PARENT (r = 0.60) and +0.70 over BLEU (r = 0.24). Guided by STAF’s fine-grained signal, we design CAVE (Cell-Anchored Verification and Editing), a training-free post-processing method that identifies unfaithful claims, traces them to specific table cells, and re-generates the offending spans. CAVE improves STAF scores by +0.14 on average across five LLMs on both ToTTo and TabSum-Ent, with the largest gains for numerical errors (+0.17)—the dominant error type for smaller models.
StructHallu-Drift: Benchmarking Structured Hallucinations Under Schema Evolution in LLMs
Mujtaba Hasan
Mujtaba Hasan
Large Language Models (LLMs) are increasingly used to generate structured outputs—JSON objects, SQL queries, and structured records—from formal schemas. While recent advances in constrained decoding and schema-aware prompting have improved syntactic compliance, the semantic reliability of these outputs remains poorly characterized. We investigate this gap through the lens of schema drift—the inevitable evolution of database schemas in production environments through column renamings, type changes, and constraint modifications.We introduce StructHallu-Drift, a benchmark and evaluation framework for studying structured hallucinations under schema evolution. We contribute: (1) a six-category hallucination taxonomy that disentangles syntactic validity from semantic fidelity; (2) a controlled evaluation suite applying realistic schema mutations at three severity levels to established NL-to-structure datasets; and (3) a systematic evaluation of four LLMs spanning 7B to 70B parameters across three structured output tasks.Experiments on 1,200 schema–model evaluation instances reveal four key findings: (i) 39–54% of structured outputs contain at least one semantic hallucination; (ii) schema drift severity has surprisingly minimal effect on hallucination rates (∼44% across all levels, p = 0.59), suggesting imperfect schema conditioning under our prompting setup; (iii) output format is the dominant factor in generation reliability, with SQL achieving ∼85% semantic validity while schema-grounded record generation drops to 7–24%; (iv) each model exhibits a distinct hallucination fingerprint, implying that mitigation strategies must be model-specific rather than universal. We publicly release our benchmark and evaluation toolkit.
Reasoners or Translators? Contamination-aware Evaluation and Neuro-Symbolic Robustness on Tax Law
Parisa Kordjamshidi | Samer Aslan | Madhavan Seshadri | Leslie Barrett | Enrico Santus
Parisa Kordjamshidi | Samer Aslan | Madhavan Seshadri | Leslie Barrett | Enrico Santus
Recent advances in large language models (LLMs) have significantly enhanced automated legal reasoning. Yet, it remains unclear whether their performance reflects genuine legal reasoning ability or artifacts of data contamination. We present a comprehensive empirical study of tax law reasoning approaches and implement a contamination detection protocol to rigorously assess LLM reliability. We show that performance can be inflated by contamination. Building on this analysis, we conduct a systematic evaluation, comparing monolithic LLMs with hybrid systems that translate statutory text into formal representations and delegate inference to symbolic solvers. We build a novel test suite designed to probe generalization to unseen documents via case and rule variations. Our findings indicate that legal reasoning is inherently compositional and that neuro-symbolic frameworks offer a more reliable and robust foundation for legal AI, as well as improved generalization to unobserved situations.
More Than Efficiency: Embedding Compression Improves Domain Adaptation in Dense Retrieval
Chunsheng Zuo | Daniel Khashabi
Chunsheng Zuo | Daniel Khashabi
Dense retrievers powered by pretrained embeddings are widely used for document retrieval but struggle in specialized domains due to the mismatches between the training and target domain distributions. Domain adaptation typically requires costly annotation and retraining of query-document pairs. In this work, we revisit an overlooked alternative: applying PCA to domain embeddings to derive lower-dimensional representations that preserve domain-relevant features while discarding non-discriminative components. Though traditionally used for efficiency, we demonstrate that this simple embedding compression can effectively improve retrieval performance. Evaluated across 9 retrievers and 14 MTEB datasets, PCA applied solely to query embeddings improves NDCG@10 in 75.4% of model-dataset pairs, offering a simple and lightweight method for domain adaptation.