Rayan Malik


2026

Large language models (LLMs) achieve impressive accuracy on standard Text-to-SQL benchmarks such as Spider and BIRD, yet enterprise databases, with hundreds of tables and complex foreign key graphs, remain a practical bottleneck. We hypothesize that a single, measurable property drives most of this gap: the join-hop depth (h) of the query, defined as the number of foreign key edges that must be traversed to gather all required columns. We introduce the Join-Hop Depth (JHD) benchmark, 410 human-annotated questions stratified by h ∈ {1, …, 6} over 12 enterprise-scale schemas. Experiments on five frontier LLMs confirm a sharp accuracy cliff: all models exceed 80% at h = 1 but fall below 40% at h = 4 and below 25% at h = 6, the typical depth of real enterprise analytics queries. To address this, we propose SchemaScope, a decomposition framework that partitions deep queries into a sequence of sub-queries with h ≤ 2, executes them independently, and merges the results. SchemaScope raises execution accuracy from 46.8% to 67.3% on JHD (GPT-4o, h ≥ 3) and improves execution accuracy by +9.3 percentage points on the BIRD development set. Error analysis shows that decomposition eliminates wrong join path errors, the dominant failure mode at high h, and shifts the residual error budget toward condition and aggregation mistakes that are amenable to existing post-processing methods.