Weiyuan Chen
2025
AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research
Yilun Zhao
|
Weiyuan Chen
|
Zhijian Xu
|
Manasi Patwardhan
|
Chengye Wang
|
Yixin Liu
|
Lovekesh Vig
|
Arman Cohan
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We introduce AbGen, the first benchmark designed to evaluate the capabilities of LLMs in designing ablation studies for scientific research. AbGen consists of 2,000 expert-annotated examples derived from 677 NLP papers. In this benchmark, LLMs are tasked with generating detailed ablation study designs for a specified module or process based on the given research context. Our evaluation of leading LLMs, such as GPT-4o and Llama-3.1, highlights a significant performance gap between these models and human experts in terms of the importance, faithfulness, and soundness of the ablation study designs. Moreover, we demonstrate that current automated evaluation methods are not reliable for our task, as they show a significant discrepancy when compared to human assessment. To better investigate this, we develop AbGen-Eval, a meta-evaluation benchmark designed to assess the reliability of commonly used automated evaluation systems in measuring LLM performance on our task. We investigate various LLM-based evaluation methods on AbGen-Eval, providing insights for future research on developing more effective and reliable LLM-based evaluation systems for complex scientific tasks.
2024
FinDVer: Explainable Claim Verification over Long and Hybrid-content Financial Documents
Yilun Zhao
|
Yitao Long
|
Tintin Jiang
|
Chengye Wang
|
Weiyuan Chen
|
Hongjun Liu
|
Xiangru Tang
|
Yiming Zhang
|
Chen Zhao
|
Arman Cohan
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
We introduce FinDVer, a comprehensive benchmark specifically designed to evaluate the explainable claim verification capabilities of LLMs in the context of understanding and analyzing long, hybrid-content financial documents. FinDVer contains 4,000 expert-annotated examples across four subsets, each focusing on a type of scenario that frequently arises in real-world financial domains. We assess a broad spectrum of 25 LLMs under long-context and RAG settings. Our results show that even the current best-performing system (i.e., GPT-4o) significantly lags behind human experts. Our detailed findings and insights highlight the strengths and limitations of existing LLMs in this new task. We believe FinDVer can serve as a valuable benchmark for evaluating LLM capabilities in claim verification over complex, expert-domain documents.
Search
Fix author
Co-authors
- Arman Cohan 2
- Chengye Wang 2
- Yilun Zhao 2
- Tintin Jiang 1
- Hongjun Liu 1
- show all...