Ziyang Li
2026
Unifying Inference-Time Planning Language Generation
Prabhu Prakash Kagitha | Bo Sun | Ishan Desai | Andrew Zhu | Cassie Huang | Manling Li | Ziyang Li | Li Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Prabhu Prakash Kagitha | Bo Sun | Ishan Desai | Andrew Zhu | Cassie Huang | Manling Li | Ziyang Li | Li Zhang
Findings of the Association for Computational Linguistics: ACL 2026
A line of work in planning uses LLM not to generate a plan, but to generate a formal representation in some planning language, which can be input into a symbolic solver to deterministically find a plan. While showing improved trust and promising performance, dozens of recent publications have proposed scattered methods on a variety of benchmarks under different experimental settings. We attempt to unify the inference-time LLM-as-formalizer methodology for classical planning by proposing a unifying organizational framework based on intermediate representations. We thus systematically evaluate more than a dozen pipelines that subsume most existing work, while proposing novel ones that involve syntactically similar but high-resource intermediate languages (such as a Python wrapper of PDDL). We provide recipes for planning language generation pipelines, draw a series of conclusions showing the efficacy of their various components, and evidence their robustness against problem complexity.
2025
TurnaboutLLM: A Deductive Reasoning Benchmark from Detective Games
Yuan Yuan | Muyu He | Muhammad Adil Shahid | Ziyang Li | Jiani Huang | Li Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Yuan Yuan | Muyu He | Muhammad Adil Shahid | Ziyang Li | Jiani Huang | Li Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
This paper introduces TurnaboutLLM, a novel framework and dataset for evaluating the deductive reasoning abilities of Large Language Models (LLMs) by leveraging the interactive gameplay of detective games Ace Attorney and Danganronpa. The framework tasks LLMs with identifying contradictions between testimonies and evidences within long narrative contexts, a challenging task due to the large answer space and diverse reasoning types presented by its questions. We evaluate twelve state-of-the-art LLMs on the dataset, hinting at limitations of popular strategies for enhancing deductive reasoning such as extensive thinking and Chain-of-Thought prompting. The results also suggest varying effects of context size, reasoning steps and answer space size on model performance. Overall, TurnaboutLLM presents a substantial challenge for LLMs’ deductive reasoning abilities in complex, narrative-rich environments.
2023
Improved Logical Reasoning of Language Models via Differentiable Symbolic Programming
Hanlin Zhang | Jiani Huang | Ziyang Li | Mayur Naik | Eric Xing
Findings of the Association for Computational Linguistics: ACL 2023
Hanlin Zhang | Jiani Huang | Ziyang Li | Mayur Naik | Eric Xing
Findings of the Association for Computational Linguistics: ACL 2023
Pre-trained large language models (LMs) struggle to perform logical reasoning reliably despite advances in scale and compositionality. In this work, we tackle this challenge through the lens of symbolic programming. We propose DSR-LM, a Differentiable Symbolic Reasoning framework where pre-trained LMs govern the perception of factual knowledge, and a symbolic module performs deductive reasoning. In contrast to works that rely on hand-crafted logic rules, our differentiable symbolic reasoning framework efficiently learns weighted rules and applies semantic loss to further improve LMs. DSR-LM is scalable, interpretable, and allows easy integration of prior knowledge, thereby supporting extensive symbolic programming to robustly derive a logical conclusion. The results of our experiments suggest that DSR-LM improves the logical reasoning abilities of pre-trained language models, resulting in a significant increase in accuracy of over 20% on deductive reasoning benchmarks. Furthermore, DSR-LM outperforms a variety of competitive baselines when faced with systematic changes in sequence length.