Jingyu Li

2026

VET: Verifiable Execution Tracing for Reliable Text-to-SQL Generation
Dongyu Wang | Jingyu Li | Lan Zhang | Ganggang.yu | Liang Huang
Findings of the Association for Computational Linguistics: ACL 2026

Large language models (LLMs) have shown remarkable capabilities in text-to-SQL generation, yet existing approaches remain prone to hallucinations and lack verification mechanisms. Current methods such as Chain-of-Thought (CoT) and Program-of-Thought (PoT) typically rely on intermediate reasoning that is either purely textual or executed only as a final step, leaving the reasoning process opaque and prone to grounding and logical hallucinations. In this paper, we introduce Verifiable Execution Tracing (VET), a novel reasoning paradigm that transforms text-to-SQL from unverifiable textual rationales into step-wise executable semantics. VET addresses these limitations by constraining the reasoning process within a candidate schema space and formulating it as a sequence of executable Python steps. Crucially, each step is executed against the real database to produce observable intermediate results, which serve as immediate verification feedback and transform the traditionally opaque generation process into a transparent, debuggable interaction with database reality.Experiments show consistent gains under matched, training-free settings, achieving 70.93% execution accuracy on BIRD and 37.04% on Spider 2.0-lite, with particularly strong improvements on complex queries.

2025

pdf bib abs

MiCEval: Unveiling Multimodal Chain of Thought’s Quality via Image Description and Reasoning Steps
Xiongtao Zhou | Jie He | Lanyu Chen | Jingyu Li | Haojing Chen | Victor Gutierrez Basulto | Jeff Z. Pan | Hanjie Chen
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

**Multimodal Chain of Thought (MCoT)** is a popular prompting strategy for improving the performance of multimodal large language models (MLLMs) across a range of complex reasoning tasks. Despite its popularity, there is a notable absence of automated methods for evaluating the quality of reasoning steps in MCoT. To address this gap, we propose **Multimodal Chain-of-Thought Evaluation (MiCEval)**, a framework designed to assess the correctness of reasoning chains by evaluating the quality of both the description and each reasoning step. The evaluation of the description component focuses on the accuracy of the image descriptions, while the reasoning step evaluates the quality of each step as it is conditionally generated based on the preceding steps. MiCEval is built upon a fine-grained dataset with annotations that rate each step according to correctness, relevance, and informativeness. Extensive experiments on four state-of-the-art MLLMs show that step-wise evaluations using MiCEval align more closely with human judgments compared to existing methods based on cosine similarity or fine-tuning approaches. MiCEval datasets and code can be found at: [https://anonymous_github/MicEval](https://anonymous.4open.science/r/MiCEval-847F/README.md).

Co-authors

Jie He 1

Venues

Findings1
NAACL1

Fix author