Alan Li
2026
A Survey on Evaluation of LLM-based Agents
Asaf Yehudai | Lilach Eden | Alan Li | Guy Uziel | Yilun Zhao | Roy Bar-Haim | Arman Cohan | Michal Shmueli-Scheuer
Findings of the Association for Computational Linguistics: ACL 2026
Asaf Yehudai | Lilach Eden | Alan Li | Guy Uziel | Yilun Zhao | Roy Bar-Haim | Arman Cohan | Michal Shmueli-Scheuer
Findings of the Association for Computational Linguistics: ACL 2026
LLM-based agents represent a paradigm shift in AI, enabling autonomous systems to plan, reason, and use tools while interacting with dynamic environments. This paper provides the first comprehensive survey of evaluation methods for these increasingly capable agents. We analyze the field of agent evaluation across five perspectives: (1) Core LLM capabilities needed for agentic workflows, like planning, and tool use; (2) Application-specific benchmarks such as web and SWE agents; (3) Evaluation of generalist agents; (4) Analysis of agent benchmarks’ core dimensions; and (5) Evaluation frameworks and tools for agent developers. Our analysis reveals current trends, including a shift toward more realistic, challenging evaluations with continuously updated benchmarks. We also identify critical gaps that future research must address—particularly in assessing cost-efficiency, safety, and robustness, and in developing fine-grained, scalable evaluation methods.
2025
SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature
David Wadden | Kejian Shi | Jacob Morrison | Alan Li | Aakanksha Naik | Shruti Singh | Nitzan Barzilay | Kyle Lo | Tom Hope | Luca Soldaini | Shannon Zejiang Shen | Doug Downey | Hannaneh Hajishirzi | Arman Cohan
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
David Wadden | Kejian Shi | Jacob Morrison | Alan Li | Aakanksha Naik | Shruti Singh | Nitzan Barzilay | Kyle Lo | Tom Hope | Luca Soldaini | Shannon Zejiang Shen | Doug Downey | Hannaneh Hajishirzi | Arman Cohan
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
We present ScIRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following instances for training and evaluation, covering 54 tasks. These tasks span five core scientific literature understanding capabilities: information extraction, summarization, question answering, claim verification, and classification. ScIRIFF is unique in being the only entirely expert-written, high-quality instruction-following dataset designed for extracting and synthesizing information from research literature across diverse scientific fields. It features complex instructions with long input contexts, detailed task descriptions, and structured outputs. To demonstrate its utility, we finetune a series of large language models (LLMs) using a mix of general domain and ScIRIFF instructions. On nine out-of-distribution held-out tasks (referred to as SciRIFF-Eval), LLMs finetuned on SciRIFF achieve 70.6% average improvement over our baselines trained only on general-domain instructions. ScIRIFF facilitates the development and evaluation of LLMs to help researchers navigate the rapidly growing body of scientific literature.
2024
Summarization-Based Document IDs for Generative Retrieval with Language Models
Alan Li | Daniel Cheng | Phillip Keung | Jungo Kasai | Noah A. Smith
Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia
Alan Li | Daniel Cheng | Phillip Keung | Jungo Kasai | Noah A. Smith
Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia
Generative retrieval (Wang et al., 2022; Tay et al., 2022) is a popular approach for end-to-end document retrieval that directly generates document identifiers given an input query. We introduce summarization-based document IDs, in which each document’s ID is composed of an extractive summary or abstractive keyphrases generated by a language model, rather than an integer ID sequence or bags of n-grams as proposed in past work. We find that abstractive, content-based IDs (ACID) and an ID based on the first 30 tokens are very effective in direct comparisons with previous approaches to ID creation. We show that using ACID improves top-10 and top-20 recall by 15.6% and 14.4% (relative) respectively versus the cluster-based integer ID baseline on the MSMARCO 100k retrieval task, and 9.8% and 9.9% respectively on the Wikipedia-based NQ 100k retrieval task. Our results demonstrate the effectiveness of human-readable, natural-language IDs created through summarization for generative retrieval. We also observed that extractive IDs outperformed abstractive IDs on Wikipedia articles in NQ but not the snippets in MSMARCO, which suggests that document characteristics affect generative retrieval performance.