William Walden


2025

pdf bib
CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers?
Jiefu Ou | William Walden | Kate Sanders | Zhengping Jiang | Kaiser Sun | Jeffrey Cheng | William Jurayj | Miriam Wanner | Shaobo Liang | Candice Morgan | Seunghoon Han | Weiqi Wang | Chandler May | Hannah Recknor | Daniel Khashabi | Benjamin Van Durme
Findings of the Association for Computational Linguistics: EMNLP 2025

A core part of scientific peer review involves providing expert critiques that directly assess the scientific claims a paper makes. While it is now possible to automatically generate plausible (if generic) reviews, ensuring that these reviews are sound and grounded in the papers’ claims remains challenging. To facilitate LLM benchmarking on these challenges, we introduce CLAIMCHECK, an annotated dataset of NeurIPS 2023 and 2024 submissions and reviews mined from OpenReview. CLAIMCHECK is richly annotated by ML experts for weakness statements in the reviews and the paper claims that they dispute, as well as fine-grained labels of the validity, objectivity, and type of the identified weaknesses. We benchmark several LLMs on three claim-centric tasks supported by CLAIMCHECK, requiring models to (1) associate weaknesses with the claims they dispute, (2) predict fine-grained labels for weaknesses and rewrite the weaknesses to enhance their specificity, and (3) verify a paper’s claims with grounded reasoning. Our experiments reveal that cutting-edge LLMs, while capable of predicting weakness labels in (2), continue to underperform relative to human experts on all other tasks.

pdf bib
Cross-Document Event-Keyed Summarization
William Walden | Pavlo Kuchmiichuk | Alexander Martin | Chihsheng Jin | Angela Cao | Claire Sun | Curisia Allen | Aaron Steven White
Proceedings of the 1st Joint Workshop on Large Language Models and Structure Modeling (XLLM 2025)

Event-keyed summarization (EKS) requires summarizing a specific event described in a document given the document text and an event representation extracted from it. In this work, we extend EKS to the cross-document setting (CDEKS), in which summaries must synthesize information from accounts of the same event as given by multiple sources. We introduce **SEAMuS** (**S**ummaries of **E**vents **A**cross **Mu**ltiple **S**ources), a high-quality dataset for CDEKS based on an expert reannotation of the FAMuS dataset for cross-document argument extraction. We present a suite of baselines on SEAMuS—covering both smaller, fine-tuned models, as well as zero- and few-shot prompted LLMs—along with detailed ablations and a human evaluation study, showing SEAMuS to be a valuable benchmark for this new task.