Yifei Li
Other people with similar names: Yifei Li
Unverified author pages with similar names: Yifei Li
2025
AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists
Yifei Li | Hanane Nour Moussa | Ziru Chen | Shijie Chen | Botao Yu | Mingyi Xue | Benjamin Burns | Tzu-Yao Chiu | Vishal Dey | Zitong Lu | Chen Wei | Qianheng Zhang | Tianyu Zhang | Song Gao | Xuhui Huang | Xia Ning | Nesreen K. Ahmed | Ali Payani | Huan Sun
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Yifei Li | Hanane Nour Moussa | Ziru Chen | Shijie Chen | Botao Yu | Mingyi Xue | Benjamin Burns | Tzu-Yao Chiu | Vishal Dey | Zitong Lu | Chen Wei | Qianheng Zhang | Tianyu Zhang | Song Gao | Xuhui Huang | Xia Ning | Nesreen K. Ahmed | Ali Payani | Huan Sun
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Despite long-standing efforts in accelerating scientific discovery with AI, building AI co-scientists remains challenging due to limited high-quality data for training and evaluation. To tackle this data scarcity issue, we present AutoSDT, an automatic pipeline that collects high-quality coding tasks in real-world data-driven discovery workflows. AutoSDT leverages the coding capabilities and parametric knowledge of LLMs to search for diverse sources, select ecologically valid tasks, and synthesize accurate task instructions and code solutions. Using our pipeline, we construct AutoSDT-5K, a dataset of 5,404 coding tasks for data-driven discovery that covers four scientific disciplines and 756 unique Python packages. To the best of our knowledge, AutoSDT-5K is the only automatically collected and the largest open dataset for data-driven scientific discovery. Expert feedback on a subset of 256 tasks shows the effectiveness of AutoSDT: 93% of the collected tasks are ecologically valid, and 92.2% of the synthesized programs are functionally correct. Trained on AutoSDT-5K, the Qwen2.5-Coder-Instruct LLM series, dubbed AutoSDT-Coder, show substantial improvement on two challenging data-driven discovery benchmarks, ScienceAgentBench and DiscoveryBench. Most notably, AutoSDT-Coder-32B reaches the same level of performance as GPT-4o on ScienceAgentBench with a success rate of 7.8%, doubling the performance of its base model. On DiscoveryBench, it lifts the hypothesis matching score to 8.1, bringing a 17.4% relative improvement and closing the gap between open-weight models and GPT-4o.
2024
AttributionBench: How Hard is Automatic Attribution Evaluation?
Yifei Li | Xiang Yue | Zeyi Liao | Huan Sun
Findings of the Association for Computational Linguistics: ACL 2024
Yifei Li | Xiang Yue | Zeyi Liao | Huan Sun
Findings of the Association for Computational Linguistics: ACL 2024
Modern generative search engines enhance the reliability of large language model (LLM) responses by providing cited evidence. However, evaluating the answer’s attribution, i.e., whether every claim within the generated responses is fully supported by its cited evidence, remains an open problem. This verification, traditionally dependent on costly human evaluation, underscores the urgent need for automatic attribution evaluation methods. To bridge the gap in the absence of standardized benchmarks for these methods, we present AttributionBench, a comprehensive benchmark compiled from various existing attribution datasets. Our extensive experiments on AttributionBench reveal the challenges of automatic attribution evaluation, even for state-of-the-art LLMs. Specifically, our findings show that even a fine-tuned GPT-3.5 only achieves around 80% macro-F1 under a binary classification formulation. A detailed analysis of more than 300 error cases indicates that a majority of failures stem from the model’s inability to process nuanced information, and the discrepancy between the information the model has access to and that human annotators do.
2023
Making Language Models Better Reasoners with Step-Aware Verifier
Yifei Li | Zeqi Lin | Shizhuo Zhang | Qiang Fu | Bei Chen | Jian-Guang Lou | Weizhu Chen
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yifei Li | Zeqi Lin | Shizhuo Zhang | Qiang Fu | Bei Chen | Jian-Guang Lou | Weizhu Chen
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Few-shot learning is a challenging task that requires language models to generalize from limited examples. Large language models like GPT-3 and PaLM have made impressive progress in this area, but they still face difficulties in reasoning tasks such as GSM8K, a benchmark for arithmetic problems. To improve their reasoning skills, previous work has proposed to guide the language model with prompts that elicit a series of reasoning steps before giving the final answer, achieving a significant improvement on GSM8K from 17.9% to 58.1% in problem-solving rate. In this paper, we present DiVeRSe (Diverse Verifier on Reasoning Step), a novel approach that further enhances the reasoning capability of language models. DiVeRSe has three main components: first, it generates diverse prompts to explore different reasoning paths for the same question; second, it uses a verifier to filter out incorrect answers based on a weighted voting scheme; and third, it verifies each reasoning step individually instead of the whole chain. We evaluate DiVeRSe on the latest language model code-davinci-002 and show that it achieves new state-of-the-art results on six of eight reasoning benchmarks (e.g., GSM8K 74.4% to 83.2%).