Kunlun Zhu


2025

pdf bib
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework
Kunlun Zhu | Yifan Luo | Dingling Xu | Yukun Yan | Zhenghao Liu | Shi Yu | Ruobing Wang | Shuo Wang | Yishan Li | Nan Zhang | Xu Han | Zhiyuan Liu | Maosong Sun
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Retrieval-Augmented Generation (RAG) is a powerful approach that enables large language models (LLMs) to incorporate external knowledge. However, evaluating the effectiveness of RAG systems in specialized scenarios remains challenging due to the high costs of data construction and the lack of suitable evaluation metrics. This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios by generating high-quality documents, questions, answers, and references through a schema-based pipeline. With a focus on factual accuracy, we propose three novel metrics—Completeness, Hallucination, and Irrelevance—to evaluate LLM-generated responses rigorously. Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples. Furthermore, the use of LLMs for scoring the proposed metrics demonstrates a high level of consistency with human evaluations. RAGEval establishes a new paradigm for evaluating RAG systems in real-world applications. The code and dataset are released at https://github.com/OpenBMB/RAGEval.

pdf bib
MultiAgentBench : Evaluating the Collaboration and Competition of LLM agents
Kunlun Zhu | Hongyi Du | Zhaochen Hong | Xiaocheng Yang | Shuyi Guo | Zhe Wang | Zhenhailong Wang | Cheng Qian | Robert Tang | Heng Ji | Jiaxuan You
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Language Models (LLMs) have shown remarkable capabilities as autonomous agents; yet existing benchmarks either focus on single-agent tasks or are confined to narrow domains, failing to capture the dynamics of multi-agent coordination and competition. In this paper, we introduce MultiAgentBench, a comprehensive benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios. Our framework measures not only task completion but also the quality of collaboration and competition using novel, milestone-based key performance indicators. Moreover, we evaluate various coordination protocols (including star, chain, tree, and graph topologies) and innovative strategies such as group discussion and cognitive planning. Notably, cognitive planning improves milestone achievement rates by 3%. Code and dataset will be made publicly available. Code and datasets are publicavailable at https://github.com/ulab-uiuc/MARBLE

pdf bib
SafeScientist: Enhancing AI Scientist Safety for Risk-Aware Scientific Discovery
Kunlun Zhu | Jiaxun Zhang | Ziheng Qi | Nuoxing Shang | Zijia Liu | Peixuan Han | Yue Su | Haofei Yu | Jiaxuan You
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Recent advancements in large language model (LLM) agents have significantly accelerated scientific discovery automation, yet concurrently raised critical ethical and safety concerns. To systematically address these challenges, we introduce **SafeScientist**, an innovative AI scientist framework explicitly designed to enhance safety and ethical responsibility in AI-driven scientific exploration. SafeScientist proactively refuses ethically inappropriate or high-risk tasks and rigorously emphasizes safety throughout the research process. To achieve comprehensive safety oversight, we integrate multiple defensive mechanisms, including prompt monitoring, agent-collaboration monitoring, tool-use monitoring, and an ethical reviewer component. Complementing SafeScientist, we propose **SciSafetyBench** , a novel benchmark specifically designed to evaluate AI safety in scientific contexts, comprising 240 high-risk scientific tasks across 6 domains, alongside 30 specially designed scientific tools and 120 tool-related risk tasks. Extensive experiments demonstrate that SafeScientist significantly improves safety performance by 35% compared to traditional AI scientist frameworks, without compromising scientific output quality. Additionally, we rigorously validate the robustness of our safety pipeline against diverse adversarial attack methods, further confirming the effectiveness of our integrated approach. The code and data will be available at https://github.com/ulab-uiuc/SafeScientist.**Warning**: this paper contains example data that may be offensive or harmful.

pdf bib
TinyScientist: An Interactive, Extensible, and Controllable Framework for Building Research Agents
Haofei Yu | Keyang Xuan | Fenghai Li | Kunlun Zhu | Zijie Lei | Jiaxun Zhang | Ziheng Qi | Kyle Richardson | Jiaxuan You
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Automatic research with Large Language Models (LLMs) is rapidly gaining importance, driving the development of increasingly complex workflows involving multi-agent systems, planning, tool usage, code execution, and human-agent interaction to accelerate research processes. However, as more researchers and developers begin to use and build upon these tools and platforms, the complexity and difficulty of extending and maintaining such agentic workflows have become a significant challenge, particularly as algorithms and architectures continue to advance. To address this growing complexity, TinyScientist identifies the essential components of the automatic research workflow and proposes an interactive, extensible, and controllable framework that adapts easily to new tools and supports iterative growth. We provide an open-source codebase, an interactive web demonstration, and a PyPI Python package to make state-of-the-art auto-research pipelines broadly accessible to every researcher and developer.

2023

pdf bib
WebCPM: Interactive Web Search for Chinese Long-form Question Answering
Yujia Qin | Zihan Cai | Dian Jin | Lan Yan | Shihao Liang | Kunlun Zhu | Yankai Lin | Xu Han | Ning Ding | Huadong Wang | Ruobing Xie | Fanchao Qi | Zhiyuan Liu | Maosong Sun | Jie Zhou
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Long-form question answering (LFQA) aims at answering complex, open-ended questions with detailed, paragraph-length responses. The de facto paradigm of LFQA necessitates two procedures: information retrieval, which searches for relevant supporting facts, and information synthesis, which integrates these facts into a coherent answer. In this paper, we introduce WebCPM, the first Chinese LFQA dataset. One unique feature of WebCPM is that its information retrieval is based on interactive web search, which engages with a search engine in real time. Following WebGPT, we develop a web search interface. We recruit annotators to search for relevant information using our interface and then answer questions. Meanwhile, the web search behaviors of our annotators would be recorded. In total, we collect 5,500 high-quality question-answer pairs, together with 15,372 supporting facts and 125,954 web search actions. We fine-tune pre-trained language models to imitate human behaviors for web search and to generate answers based on the collected facts. Our LFQA pipeline, built on these fine-tuned models, generates answers that are no worse than human-written ones in 32.5% and 47.5% of the cases on our dataset and DuReader, respectively. The interface, dataset, and codes are publicly available at https://github.com/thunlp/WebCPM.