Zhenyu Wu

XJTU

Other people with similar names: Zhenyu Wu (May refer to several people)

2025

pdf bib abs
Enhancing Mathematical Reasoning in LLMs by Stepwise Correction
Zhenyu Wu | Qingkai Zeng | Zhihan Zhang | Zhaoxuan Tan | Chao Shen | Meng Jiang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Best-of-N decoding methods instruct large language models (LLMs) to generate multiple solutions, score each using a scoring function, and select the highest scored as the final answer to mathematical reasoning problems. However, this repeated independent process often leads to the same mistakes, making the selected solution still incorrect. We propose a novel prompting method named Stepwise Correction (StepCo) that helps LLMs identify and revise incorrect steps in their generated reasoning paths. It iterates verification and revision phases that employ a process-supervised verifier. The verify-then-revise process not only improves answer correctness but also reduces token consumption with fewer paths needed to generate. With StepCo, a series of LLMs demonstrate exceptional performance. Notably, using GPT-4o as the backend LLM, StepCo achieves an average accuracy of 94.1 across eight datasets, significantly outperforming the state-of-the-art Best-of-N method by +2.4, while reducing token consumption by 77.8%. Our implementation is made publicly available at https://wzy6642.github.io/stepco.github.io.

Taxonomies provide structural representations of knowledge and are crucial in various applications. The task of taxonomy expansion involves integrating emerging entities into existing taxonomies by identifying appropriate parent entities for these new query entities. Previous methods rely on self-supervised techniques that generate annotation data from existing taxonomies but are less effective with small taxonomies (fewer than 100 entities). In this work, we introduce CodeTaxo, a novel approach that leverages large language models through code language prompts to capture the taxonomic structure. Extensive experiments on five real-world benchmarks from different domains demonstrate that CodeTaxo consistently achieves superior performance across all evaluation metrics, significantly outperforming previous state-of-the-art methods. The code and data are available at https://github.com/QingkaiZeng/CodeTaxo-official.

2024

Intrinsic self-correct was a method that instructed large language models (LLMs) to verify and correct their responses without external feedback. Unfortunately, the study concluded that the LLMs could not self-correct reasoning yet. We find that a simple yet effective prompting method enhances LLM performance in identifying and correcting inaccurate answers without external feedback.That is to mask a key condition in the question, add the current response to construct a verification question, and predict the condition to verify the response. The condition can be an entity in an open-domain question or a numerical value in an arithmetic question, which requires minimal effort (via prompting) to identify. We propose an iterative verify-then-correct framework to progressively identify and correct (probably) false responses, named ProCo. We conduct experiments on three reasoning tasks. On average, ProCo, with GPT-3.5-Turbo-1106 as the backend LLM, yields +6.8 exact match on four open-domain question answering datasets, +14.1 accuracy on three arithmetic reasoning datasets, and +9.6 accuracy on a commonsense reasoning dataset, compared to Self-Correct.Our implementation is made publicly available at https://wzy6642.github.io/proco.github.io/.

pdf bib abs
Instructing Large Language Models to Identify and Ignore Irrelevant Conditions
Zhenyu Wu | Chao Shen | Meng Jiang
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Math word problem (MWP) solving requires generating a reasoning path based on a given problem description that often contains irrelevant conditions.Existing chain-of-thought (CoT) prompting methods elicited multi-step reasoning abilities of large language models (LLMs) to solve MWPs.However, they were seriously confused by the irrelevant conditions, resulting in low accuracy.In this paper, we propose a novel approach named I³C that instructs LLMs to identify and ignore irrelevant conditions.It identifies a set of irrelevant condition candidates that have a weak semantic relevance with the question.Then it prompts LLMs to verify the irrelevant conditions.Lastly it instructs the LLMs with the verification on relevant and irrelevant conditions to avoid confusion and improve reasoning paths.Moreover, we propose to select (problem, reasoning paths) pairs as demonstrations to enhance I³C with few-shot reasoning. We develop I³C-Select that selects the most confusing problems based on the semantic relevance measurement.We conduct extensive experiments on eight MWP datasets.I³C can be combined with any CoT prompting methods to improve the performance of solving MWPs.Notably, with GPT-3.5-Turbo and I³C-Select, we achieve an accuracy of 96.0 and 94.1 on GSM-IC2-1K and GSM-ICM-1K, respectively, significantly outperforming the state-of-the-art few-shot prompting method Complex-CoT by +11.7 and +11.1.Our implementation is made publicly available at https://wzy6642.github.io/I3C.github.io/.

Co-authors

Yuyang Bai 1

Shangbin Feng 1

Venues

Fix author