Ye Yuan
Papers on this page may belong to the following people: Ye Yuan, Ye Yuan
2026
Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4
Chengwu Liu | Yichun Yin | Ye Yuan | Jiaxuan Xie | Botao Li | Siqi Li | Jianhao Shen | Yan Xu | Lifeng Shang | Ming Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chengwu Liu | Yichun Yin | Ye Yuan | Jiaxuan Xie | Botao Li | Siqi Li | Jianhao Shen | Yan Xu | Lifeng Shang | Ming Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Most ATP benchmarks embed the final answer within the formal statement — a convention we call "Easy Mode" — a design that simplifies the task relative to what human competitors face and may lead to optimistic estimates of model capability.We call the stricter, more realistic setting "Hard Mode": the system must independently discover the answer before constructing a formal proof.To enable Hard Mode research, we make two contributions.First, we release MiniF2F-Hard and FIMO-Hard, expert-reannotated Hard Mode variants of two widely-used ATP benchmarks.Second, we introduce Discover And Prove (DAP), an agentic framework that uses LLM natural-language reasoning with explicit self-reflection to discover answers, then rewrites Hard Mode statements into Easy Mode ones for existing ATP provers.DAP sets the state of the art: on CombiBench it raises solved problems from 7 (previous SOTA, Pass@16) to 10; on PutnamBench it is the first system to formally prove 36 theorems in Hard Mode — while simultaneously revealing that state-of-the-art LLMs exceed 80% answer accuracy on the same problems where formal provers manage under 10%, exposing a substantial gap that Hard Mode benchmarks are uniquely suited to measure.
SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models
Yiyang Gu | Junwei Yang | Junyu Luo | Ye Yuan | Bin Feng | Yingce Xia | Shufang Xie | Kaili Liu | Bohan Wu | Qi Shi | Haoran Li | Beier Xiao | Zhiping Xiao | Xiao Luo | Weizhi Zhang | Philip S. Yu | Zequn Liu | Ming Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yiyang Gu | Junwei Yang | Junyu Luo | Ye Yuan | Bin Feng | Yingce Xia | Shufang Xie | Kaili Liu | Bohan Wu | Qi Shi | Haoran Li | Beier Xiao | Zhiping Xiao | Xiao Luo | Weizhi Zhang | Philip S. Yu | Zequn Liu | Ming Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine-grained capabilities required in practice. Most benchmarks are manually curated or domain-generic, limiting scalability and alignment with real scientific use cases. In this paper, we propose a new framework named SciCustom to address the problem. It enables the custom construction of benchmarks from large-scale scientific data to evaluate application-specific scientific capabilities in LLMs. SciCustom first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus. These units enable relevance-aware benchmark retrieval via binary search, followed by proxy subset selection and data-grounded benchmark generation for efficient evaluation. Experiments in chemistry and healthcare demonstrate that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook, while requiring neither expert annotation nor synthetic question generation. This work provides a scalable and application-aware foundation for benchmarking scientific capabilities in LLMs.
2025
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation
Jinsheng Huang | Liang Chen | Taian Guo | Fu Zeng | Yusheng Zhao | Bohan Wu | Ye Yuan | Haozhe Zhao | Zhihui Guo | Yichi Zhang | Jingyang Yuan | Wei Ju | Luchen Liu | Tianyu Liu | Baobao Chang | Ming Zhang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Jinsheng Huang | Liang Chen | Taian Guo | Fu Zeng | Yusheng Zhao | Bohan Wu | Ye Yuan | Haozhe Zhao | Zhihui Guo | Yichi Zhang | Jingyang Yuan | Wei Ju | Luchen Liu | Tianyu Liu | Baobao Chang | Ming Zhang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options. However, many benchmarks used for such evaluations suffer from systematic biases. Remarkably, Large Language Models (LLMs) without any visual perception capabilities achieve non-trivial performance, undermining the credibility of these evaluations. To address this issue while maintaining the efficiency of MCQ evaluations, we propose MMEVALPRO, a benchmark designed to avoid Type-I errors through a trilogy evaluation pipeline and more rigorous metrics. For each original question from existing benchmarks, human annotators augment it by creating one perception question and one knowledge anchor question through a meticulous annotation process. MMEVALPRO comprises 2,138 question triplets, totaling 6,414 distinct questions. Two-thirds of these questions are manually labeled by human experts, while the rest are sourced from existing benchmarks (MMMU, ScienceQA, and MathVista). Compared with the existing benchmarks, our experiments with the latest LLMs and LMMs demonstrate that MMEVALPRO is **more challenging** (the best LMM lags behind human performance by 31.73%, compared to an average gap of 8.03% in previous benchmarks) and **more trustworthy** (the best LLM trails the best LMM by 23.09%, whereas the gap for previous benchmarks is just 14.64%). Our in-depth analysis explains the reason for the large performance gap and justifies the trustworthiness of evaluation, underscoring its significant potential for advancing future research.
Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification
Chengwu Liu | Ye Yuan | Yichun Yin | Yan Xu | Xin Xu | Zaoyu Chen | Yasheng Wang | Lifeng Shang | Qun Liu | Ming Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chengwu Liu | Ye Yuan | Yichun Yin | Yan Xu | Xin Xu | Zaoyu Chen | Yasheng Wang | Lifeng Shang | Qun Liu | Ming Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chain-of-Thought (CoT) prompting has become the de facto method to elicit reasoning capabilities from large language models (LLMs). However, to mitigate hallucinations in CoT that are notoriously difficult to detect, current methods such as process reward models (PRMs) or self-consistency operate as opaque boxes and do not provide checkable evidence for their judgments, possibly limiting their effectiveness. To address this issue, we draw inspiration from the idea that “the gold standard for supporting a mathematical claim is to provide a proof”. We propose a retrospective, step-aware formal verification framework Safe. Rather than assigning arbitrary scores, we strive to articulate mathematical claims in formal mathematical language Lean 4 at each reasoning step and provide formal proofs to identify hallucinations. We evaluate our framework Safe across multiple language models and various mathematical datasets, demonstrating a significant performance improvement while offering interpretable and verifiable evidence. We also propose FormalStep as a benchmark for step correctness theorem proving with 30,809 formal statements. To the best of our knowledge, our work represents the first endeavor to utilize formal mathematical language Lean 4 for verifying content generated by LLMs, aligning with the reason why formal mathematical languages were created in the first place: to provide a robust foundation for hallucination-prone human-written proofs.
2024
Measuring Social Norms of Large Language Models
Ye Yuan | Kexin Tang | Jianhao Shen | Ming Zhang | Chenguang Wang
Findings of the Association for Computational Linguistics: NAACL 2024
Ye Yuan | Kexin Tang | Jianhao Shen | Ming Zhang | Chenguang Wang
Findings of the Association for Computational Linguistics: NAACL 2024
We present a new challenge to examine whether large language models understand social norms. In contrast to existing datasets, our dataset requires a fundamental understanding of social norms to solve. Our dataset features the largest set of social norm skills, consisting of 402 skills and 12,383 questions covering a wide set of social norms ranging from opinions and arguments to culture and laws. We design our dataset according to the K-12 curriculum. This enables the direct comparison of the social understanding of large language models to humans, more specifically, elementary students. While prior work generates nearly random accuracy on our benchmark, recent large language models such as GPT3.5-Turbo and LLaMA2-Chat are able to improve the performance significantly, only slightly below human performance. We then propose a multi-agent framework based on large language models to improve the models’ ability to understand social norms. This method further improves large language models to be on par with humans. Given the increasing adoption of large language models in real-world applications, our finding is particularly important and presents a unique direction for future improvements.
2023
TRIGO: Benchmarking Formal Mathematical Proof Reduction for Generative Language Models
Jing Xiong | Jianhao Shen | Ye Yuan | Haiming Wang | Yichun Yin | Zhengying Liu | Lin Li | Zhijiang Guo | Qingxing Cao | Yinya Huang | Chuanyang Zheng | Xiaodan Liang | Ming Zhang | Qun Liu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Jing Xiong | Jianhao Shen | Ye Yuan | Haiming Wang | Yichun Yin | Zhengying Liu | Lin Li | Zhijiang Guo | Qingxing Cao | Yinya Huang | Chuanyang Zheng | Xiaodan Liang | Ming Zhang | Qun Liu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Automated theorem proving (ATP) has become an appealing domain for exploring the reasoning ability of the recent successful generative language models. However, current ATP benchmarks are mainly focus on symbolic inference, but rarely involve the understanding of complex number combination reasoning. In this work, we propose TRIGO, an ATP benchmark that not only requires a model to reduce a trigonometric expression with step-by-step proof but also evaluates a generative LM’s reasoning ability on formulas and capability to manipulate, group, and factor number terms. We gather trigonometric expressions and their reduced forms from web, annotate the simplification process manually, and translate it into the “Lean” formal language system. We then automatically generate additional examples from the annotated samples to expand the dataset. Furthermore, we also create three automatically generated training and testing datasets of varying difficulty and distributions. Our extensive experiments show our proposed TRIGO poses a new challenge for advanced generative LM’s including GPT-4 which is pre-trained on a considerable amount of open-source formal theorem-proving language data, and provide a new tool to study the generative LM’s ability on both formal and mathematical reasoning.
DT-Solver: Automated Theorem Proving with Dynamic-Tree Sampling Guided by Proof-level Value Function
Haiming Wang | Ye Yuan | Zhengying Liu | Jianhao Shen | Yichun Yin | Jing Xiong | Enze Xie | Han Shi | Yujun Li | Lin Li | Jian Yin | Zhenguo Li | Xiaodan Liang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Haiming Wang | Ye Yuan | Zhengying Liu | Jianhao Shen | Yichun Yin | Jing Xiong | Enze Xie | Han Shi | Yujun Li | Lin Li | Jian Yin | Zhenguo Li | Xiaodan Liang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent advances in neural theorem-proving resort to large language models and tree searches. When proving a theorem, a language model advises single-step actions based on the current proving state and the tree search finds a sequence of correct steps using actions given by the language model. However, prior works often conduct constant computation efforts for each proving state while ignoring that the hard states often need more exploration than easy states. Moreover, they evaluate and guide the proof search solely depending on the current proof state instead of considering the whole proof trajectory as human reasoning does. Here, to accommodate general theorems, we propose a novel Dynamic-Tree Driven Theorem Solver (DT-Solver) by guiding the search procedure with state confidence and proof-level values. Specifically, DT-Solver introduces a dynamic-tree Monte-Carlo search algorithm, which dynamically allocates computing budgets for different state confidences, guided by a new proof-level value function to discover proof states that require substantial exploration. Experiments on two popular theorem-proving datasets, PISA and Mathlib, show significant performance gains by our DT-Solver over the state-of-the-art approaches, with a 6.65% improvement on average in terms of success rate. And especially under low computing resource settings (11.03% improvement on average).
TREA: Tree-Structure Reasoning Schema for Conversational Recommendation
Wendi Li | Wei Wei | Xiaoye Qu | Xian-Ling Mao | Ye Yuan | Wenfeng Xie | Dangyang Chen
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Wendi Li | Wei Wei | Xiaoye Qu | Xian-Ling Mao | Ye Yuan | Wenfeng Xie | Dangyang Chen
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Conversational recommender systems (CRS) aim to timely trace the dynamic interests of users through dialogues and generate relevant responses for item recommendations. Recently, various external knowledge bases (especially knowledge graphs) are incorporated into CRS to enhance the understanding of conversation contexts. However, recent reasoning-based models heavily rely on simplified structures such as linear structures or fixed-hierarchical structures for causality reasoning, hence they cannot fully figure out sophisticated relationships among utterances with external knowledge. To address this, we propose a novel Tree structure Reasoning schEmA named TREA. TREA constructs a multi-hierarchical scalable tree as the reasoning structure to clarify the causal relationships between mentioned entities, and fully utilizes historical conversations to generate more reasonable and suitable responses for recommended results. Extensive experiments on two public CRS datasets have demonstrated the effectiveness of our approach.
2022
PALT: Parameter-Lite Transfer of Language Models for Knowledge Graph Completion
Jianhao Shen | Chenguang Wang | Ye Yuan | Jiawei Han | Heng Ji | Koushik Sen | Ming Zhang | Dawn Song
Findings of the Association for Computational Linguistics: EMNLP 2022
Jianhao Shen | Chenguang Wang | Ye Yuan | Jiawei Han | Heng Ji | Koushik Sen | Ming Zhang | Dawn Song
Findings of the Association for Computational Linguistics: EMNLP 2022
This paper presents a parameter-lite transfer learning approach of pretrained language models (LM) for knowledge graph (KG) completion. Instead of finetuning, which modifies all LM parameters, we only tune a few new parameters while keeping the original LM parameters fixed. We establish this via reformulating KG completion as a “fill-in-the-blank” task, and introducing a parameter-lite encoder on top of the original LMs. We show that, by tuning far fewer parameters than finetuning, LMs transfer non-trivially to most tasks and reach competitiveness with prior state-of-the-art approaches. For instance, we outperform the fully finetuning approaches on a KG completion benchmark by tuning only 1% of the parameters.
Search
Fix author
Co-authors
- Jianhao Shen 5
- Yichun Yin 4
- Ming Zhang 4
- Ming Zhang 3
- Lin Li 2
- Xiaodan Liang 2
- Chengwu Liu 2
- Qun Liu 2
- Zhengying Liu 2
- Lifeng Shang 2
- Chenguang Wang 2
- Haiming Wang 2
- Bohan Wu 2
- Jing Xiong 2
- Yan Xu 2
- Qingxing Cao 1
- Baobao Chang (常宝宝) 1
- Dangyang Chen 1
- Liang Chen 1
- Zaoyu Chen 1
- Bin Feng 1
- Yiyang Gu 1
- Taian Guo 1
- Zhihui Guo 1
- Zhijiang Guo 1
- Jiawei Han 1
- Jinsheng Huang 1
- Yinya Huang 1
- Heng Ji 1
- Wei Ju 1
- Botao Li 1
- Haoran Li 1
- Siqi Li 1
- Wendi Li 1
- Yujun Li 1
- Zhenguo Li 1
- Kaili Liu 1
- Luchen Liu 1
- Tianyu Liu 1
- Zequn Liu 1
- Junyu Luo 1
- Xiao Luo 1
- Xian-Ling Mao 1
- Xiaoye Qu 1
- Koushik Sen 1
- Han Shi 1
- Qi Shi 1
- Dawn Song 1
- Kexin Tang 1
- Yasheng Wang 1
- Wei Wei 1
- Yingce Xia 1
- Beier Xiao 1
- Zhiping Xiao 1
- Enze Xie 1
- Jiaxuan Xie 1
- Shufang Xie 1
- Wenfeng Xie 1
- Xin Xu 1
- Junwei Yang 1
- Jian Yin 1
- Philip S. Yu 1
- Jingyang Yuan 1
- Fu Zeng 1
- Weizhi Zhang 1
- Yichi Zhang 1
- Haozhe Zhao 1
- Yusheng Zhao 1
- Chuanyang Zheng 1