Yu-Feng Li
2026
TabularMath: Understanding Math Reasoning over Tables with Large Language Models
Shi-Yu Tian | Zhi Zhou | Wei Dong | Kun-Yang Yu | Ming Yang | Zi-Jian Cheng | Lan-Zhe Guo | Yu-Feng Li
Findings of the Association for Computational Linguistics: ACL 2026
Shi-Yu Tian | Zhi Zhou | Wei Dong | Kun-Yang Yu | Ming Yang | Zi-Jian Cheng | Lan-Zhe Guo | Yu-Feng Li
Findings of the Association for Computational Linguistics: ACL 2026
Mathematical reasoning has long been a key benchmark for evaluating large language models. Although substantial progress has been made on math word problems, the need for reasoning over tabular data in real-world applications has been overlooked. For instance, applications such as business intelligence demand not only multi-step numerical reasoning with tables but also robustness to incomplete or inconsistent information. However, comprehensive evaluation in this area is severely limited, constrained by the reliance on manually collected tables that are difficult to scale and the lack of coverage for potential traps encountered in real-world scenarios. To address this problem, we propose AutoT2T, a neuro-symbolic framework that controllably transforms math word problems into scalable and verified tabular reasoning tasks. Building on this pipeline, we develop TabularMath, a benchmark comprising four subsets that include both text-based and image-based tables, covering table complexity, table quality, and table representation dimensions. Our study reveals three key observations: (1) Table complexity and reasoning difficulty impact reasoning performance jointly; (2) Low-quality tables pose severe risks to reliable reasoning in current LLMs; (3) Different table modalities show similar trends, with text-based tables typically being easier for models to reason over. In-depth analyses are conducted for each observation to guide future research.
2025
AutoEvolve: Automatically Evolving Queries for Applicable and Scalable Retrieval-Augmented Generation Benchmarking
Ding-Chu Zhang | Xiaowen Zhang | Yue Fei | Renjun Hu | Xiao-Wen Yang | Zhi Zhou | Baixuan Li | Yu-Feng Li | Xing Shi | Wei Lin
Findings of the Association for Computational Linguistics: EMNLP 2025
Ding-Chu Zhang | Xiaowen Zhang | Yue Fei | Renjun Hu | Xiao-Wen Yang | Zhi Zhou | Baixuan Li | Yu-Feng Li | Xing Shi | Wei Lin
Findings of the Association for Computational Linguistics: EMNLP 2025
Retrieval-augmented generation (RAG) enables large language models (LLMs) to address queries beyond their internal knowledge by integrating domain knowledge in specialized corpus, which necessitates the generation of benchmarks on specific corpus to evaluate RAG systems. However, existing automated generation methods exhibit Weak Applicability and Weak Scalability. Weak Applicability refers to the reliance on metadata from specific corpora for query generation, constraining applicability to other corpora. Weak Scalability is characterized by fixed query content after generation, unable to dynamically increase difficulty, limiting scalability of the query. To overcome these issues, we propose AutoEvolve, an applicable approach for dynamically evolving queries to construct scalable RAG benchmarks. Our approach is grounded in three key innovations: (i) a corpus-agnostic method for constructing the universal entity-document graph; (ii) a suite of evolution operations designed to dynamically update queries; and (iii) a difficulty-guided metric that directs query evolution process. Through experiments on three generated benchmarks, we demonstrate that AutoEvolve evolves queries that are significantly more challenging, paving the way for more applicable and scalable RAG evaluations.
EvolveSearch: An Iterative Self-Evolving Search Agent
Ding-Chu Zhang | Yida Zhao | Jialong Wu | Liwen Zhang | Baixuan Li | Wenbiao Yin | Yong Jiang | Yu-Feng Li | Kewei Tu | Pengjun Xie | Fei Huang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Ding-Chu Zhang | Yida Zhao | Jialong Wu | Liwen Zhang | Baixuan Li | Wenbiao Yin | Yong Jiang | Yu-Feng Li | Kewei Tu | Pengjun Xie | Fei Huang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
The rapid advancement of large language models (LLMs) has transformed the landscape of agentic information seeking capabilities through the integration of tools such as search engines and web browsers. However, current mainstream approaches for enabling LLM web search proficiency face significant challenges: supervised fine-tuning struggles with data production in open-search domains, while RL converges quickly, limiting their data utilization efficiency. To address these issues, we propose EvolveSearch, a novel iterative self-evolution framework that combines SFT and RL to enhance agentic web search capabilities without any external human-annotated reasoning data. Extensive experiments on seven multi-hop question-answering (MHQA) benchmarks demonstrate that EvolveSearch consistently improves performance across iterations, ultimately achieving an average improvement of 4.7% over the current state-of-the-art across seven benchmarks, opening the door to self-evolution agentic capabilities in open web search domains.
VCSearch: Bridging the Gap Between Well-Defined and Ill-Defined Problems in Mathematical Reasoning
Shi-Yu Tian | Zhi Zhou | Kun-Yang Yu | Ming Yang | Lin-Han Jia | Lan-Zhe Guo | Yu-Feng Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Shi-Yu Tian | Zhi Zhou | Kun-Yang Yu | Ming Yang | Lin-Han Jia | Lan-Zhe Guo | Yu-Feng Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) have demonstrated impressive performance on reasoning tasks, including mathematical reasoning. However, the current evaluation mostly focuses on carefully constructed benchmarks and neglects the consideration of real-world reasoning problems that present missing or contradictory conditions, known as ill-defined problems. To further study this problem, we develop a large-scale benchmark called Problems with Missing and Contradictory conditions (PMC) containing over 5,000 validated ill-defined mathematical problems. Our preliminary experiments through PMC reveal two challenges about existing methods: (1) traditional methods exhibit a trade-off between solving accuracy and rejection capabilities, and (2) formal methods struggle with modeling complex problems. To address these challenges, We develop Variable-Constraint Search (VCSearch), a training-free framework that leverages formal language to detect ill-defined problems, where a variable-constraint pair search strategy is incorporated to improve the modeling capability of formal language. Extensive experiments demonstrate that VCSearch improves the accuracy of identifying unsolvable problems by at least 12% across different LLMs, thus achieving stronger robust mathematical reasoning ability.