EvoBench: Towards Real-world LLM-Generated Text Detection Benchmarking for Evolving Large Language Models

Xiao Yu; Yi Yu; Dongrui Liu; Kejiang Chen; Weiming Zhang; Nenghai Yu; Jing Shao

EvoBench: Towards Real-world LLM-Generated Text Detection Benchmarking for Evolving Large Language Models

Xiao Yu, Yi Yu, Dongrui Liu, Kejiang Chen, Weiming Zhang, Nenghai Yu, Jing Shao

Abstract

With the widespread of Large Language Models (LLMs), there has been an increasing need to detect LLM-generated texts, prompting extensive research in this area. However, existing detection methods mainly evaluate on static benchmarks, which neglect the evolving nature of LLMs. Relying on existing static benchmarks could create a misleading sense of security, overestimating the real-world effectiveness of detection methods.To bridge this gap, we introduce EvoBench, a dynamic benchmark considering a new dimension of generalization across continuously evolving LLMs.EvoBench categorizes the evolving LLMs into (1) updates over time and (2) developments like finetuning and pruning, covering 7 LLM families and their 29 evolving versions. To measure the generalization across evolving LLMs, we introduce a new EMG (Evolving Model Generalization) metric. Our evaluation of 14 detection methods on EvoBench reveals that they all struggle to maintain generalization when confronted with evolving LLMs. To mitigate the generalization problems, we further propose improvement strategies. For zero-shot detectors, we propose pruning the scoring model to extract shared features. For supervised detectors, we also propose a practical training strategy.Our research sheds light on critical challenges in real-world LLM-generated text detection and represents a significant step toward practical applications.

Anthology ID:: 2025.findings-acl.754
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 14605–14620
Language:
URL:: https://preview.aclanthology.org/display_plenaries/2025.findings-acl.754/
DOI:
Bibkey:
Cite (ACL):: Xiao Yu, Yi Yu, Dongrui Liu, Kejiang Chen, Weiming Zhang, Nenghai Yu, and Jing Shao. 2025. EvoBench: Towards Real-world LLM-Generated Text Detection Benchmarking for Evolving Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2025, pages 14605–14620, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: EvoBench: Towards Real-world LLM-Generated Text Detection Benchmarking for Evolving Large Language Models (Yu et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/display_plenaries/2025.findings-acl.754.pdf

PDF Cite Search Fix data