Jiannan Cao
2026
ToolGate: Contract-Grounded and Verified Tool Execution for LLMs
Yanming Liu | Xinyue Peng | Jiannan Cao | Xinyi Wang | Songhang Deng | Jintao Chen | Jianwei Yin | Xuhong Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Yanming Liu | Xinyue Peng | Jiannan Cao | Xinyi Wang | Songhang Deng | Jintao Chen | Jianwei Yin | Xuhong Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Models (LLMs) augmented with external tools have demonstrated remarkable capabilities in complex reasoning tasks. However, existing frameworks rely heavily on natural language reasoning to determine when tools can be invoked and whether their results should be committed, lacking formal guarantees for logical safety and verifiability. We present ToolGate, a forward execution framework that provides logical safety guarantees and verifiable state evolution for LLM tool calling. ToolGate maintains an explicit symbolic state space as a typed key-value mapping representing trusted world information throughout the reasoning process. Each tool is formalized as a Hoare-style contract consisting of a precondition and a postcondition, where the precondition gates tool invocation by checking whether the current state satisfies the required conditions, and the postcondition determines whether the tool’s result can be committed to update the state through runtime verification. Our approach guarantees that the symbolic state evolves only through verified tool executions, preventing invalid or hallucinated results from corrupting the world representation. Experimental validation demonstrates that ToolGate significantly improves the reliability and verifiability of tool-augmented LLM systems while maintaining competitive performance on complex multi-step reasoning tasks. This work establishes a foundation for building more trustworthy and debuggable AI systems that integrate language models with external tools.
2025
EquiBench: Benchmarking Large Language Models’ Reasoning about Program Semantics via Equivalence Checking
Anjiang Wei | Jiannan Cao | Ran Li | Hongyu Chen | Yuhui Zhang | Ziheng Wang | Yuan Liu | Thiago S. F. X. Teixeira | Diyi Yang | Ke Wang | Alex Aiken
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Anjiang Wei | Jiannan Cao | Ran Li | Hongyu Chen | Yuhui Zhang | Ziheng Wang | Yuan Liu | Thiago S. F. X. Teixeira | Diyi Yang | Ke Wang | Alex Aiken
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
As large language models (LLMs) become integral to code-related tasks, a central question emerges: Do LLMs truly understand program semantics? We introduce EquiBench, a new benchmark for evaluating LLMs through equivalence checking, i.e., determining whether two programs produce identical outputs for all possible inputs. Unlike prior code generation benchmarks, this task directly tests a model’s ability to reason about program semantics. EquiBench consists of 2400 program pairs across four languages and six categories. These pairs are generated through program analysis, compiler scheduling, and superoptimization, ensuring high-confidence labels, nontrivial difficulty, and full automation. We evaluate 19 state-of-the-art LLMs and find that in the most challenging categories, the best accuracies are 63.8% and 76.2%, only modestly above the 50% random baseline. Further analysis reveals that models often rely on syntactic similarity rather than exhibiting robust reasoning about program semantics, highlighting current limitations. Our code and dataset are publicly available at https://github.com/Anjiang-Wei/equibench
2024
Unveiling the Spectrum of Data Contamination in Language Model: A Survey from Detection to Remediation
Chunyuan Deng | Yilun Zhao | Yuzhao Heng | Yitong Li | Jiannan Cao | Xiangru Tang | Arman Cohan
Findings of the Association for Computational Linguistics: ACL 2024
Chunyuan Deng | Yilun Zhao | Yuzhao Heng | Yitong Li | Jiannan Cao | Xiangru Tang | Arman Cohan
Findings of the Association for Computational Linguistics: ACL 2024
Data contamination has garnered increased attention in the era of Large language models (LLMs) due to the reliance on extensive internet-derived training corpora. The issue of training corpus overlap with evaluation benchmarks—referred to as contamination—has been the focus of significant recent research. This body of work aims to identify contamination, understand its impacts, and explore mitigation strategies from diverse perspectives. However, comprehensive studies that provide a clear pathway from foundational concepts to advanced insights are lacking in this nascent field. Therefore, we present the first survey in the field of data contamination. We begin by examining the effects of data contamination across various stages and forms. We then provide a detailed analysis of current contamination detection methods, categorizing them to highlight their focus, assumptions, strengths, and limitations. We also discuss mitigation strategies, offering a clear guide for future research. This survey serves as a succinct overview of the most recent advancements in data contamination research, providing a straightforward guide for the benefit of future research endeavors.
RA-ISF: Learning to Answer and Understand from Retrieval Augmentation via Iterative Self-Feedback
Yanming Liu | Xinyue Peng | Xuhong Zhang | Weihao Liu | Jianwei Yin | Jiannan Cao | Tianyu Du
Findings of the Association for Computational Linguistics: ACL 2024
Yanming Liu | Xinyue Peng | Xuhong Zhang | Weihao Liu | Jianwei Yin | Jiannan Cao | Tianyu Du
Findings of the Association for Computational Linguistics: ACL 2024
Large language models (LLMs) demonstrate exceptional performance in numerous tasks but still heavily rely on knowledge stored in their parameters. Moreover, updating this knowledge incurs high training costs. Retrieval-augmented generation (RAG) methods address this issue by integrating external knowledge. The model can answer questions it couldn’t previously by retrieving knowledge relevant to the query. This approach improves performance in certain scenarios for specific tasks. However, if irrelevant texts are retrieved, it may impair model performance. In this paper, we propose Retrieval Augmented Iterative Self-Feedback (RA-ISF), a framework that iteratively decomposes tasks and processes them in three submodules to enhance the model’s problem-solving capabilities. Experiments show that our method outperforms existing benchmarks, performing well on models like GPT3.5, Llama2, significantly enhancing factual reasoning capabilities and reducing hallucinations.
MIMIR: A Customizable Agent Tuning Platform for Enhanced Scientific Applications
Xiangru Tang | Chunyuan Deng | Hanmin Wang | Haoran Wang | Yilun Zhao | Wenqi Shi | May Fung | Wangchunshu Zhou | Jiannan Cao | Heng Ji | Arman Cohan | Mark Gerstein
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Xiangru Tang | Chunyuan Deng | Hanmin Wang | Haoran Wang | Yilun Zhao | Wenqi Shi | May Fung | Wangchunshu Zhou | Jiannan Cao | Heng Ji | Arman Cohan | Mark Gerstein
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Recently, large language models (LLMs) have evolved into interactive agents, proficient in planning, tool use, and task execution across various tasks. However, without agent-tuning, open-source models like LLaMA2 currently struggle to match the efficiency of larger models such as GPT-4 in scientific applications due to a lack of agent tuning datasets. In response, we introduce MIMIR, a streamlined platform that leverages large LLMs to generate agent-tuning data for fine-tuning smaller, specialized models. By employing a role-playing methodology, MIMIR enables larger models to simulate various roles and create interaction data, which can then be used to fine-tune open-source models like LLaMA2. This approach ensures that even smaller models can effectively serve as agents in scientific tasks. Integrating these features into an end-to-end platform, MIMIR facilitates everything from the uploading of scientific data to one-click agent fine-tuning. MIMIR is publicly released and actively maintained at https://github. com/gersteinlab/MIMIR, along with a demo video for quick-start, calling for broader development.
Search
Fix author
Co-authors
- Arman Cohan 2
- Chunyuan Deng 2
- Yanming Liu 2
- Xinyue Peng 2
- Xiangru Tang 2
- Jianwei Yin 2
- Xuhong Zhang 2
- Yilun Zhao 2
- Alex Aiken 1
- Hongyu Chen 1
- Jintao Chen 1
- Songhang Deng 1
- Tianyu Du 1
- May Fung 1
- Mark Gerstein 1
- Yuzhao Heng 1
- Heng Ji 1
- Ran Li 1
- Yitong Li 1
- Yuan Liu 1
- Weihao Liu 1
- Wenqi Shi 1
- Thiago S. F. X. Teixeira 1
- Ziheng Wang 1
- Ke Wang 1
- Xinyi Wang 1
- Hanmin Wang 1
- Haoran Wang 1
- Anjiang Wei 1
- Diyi Yang 1
- Yuhui Zhang 1
- Wangchunshu Zhou 1