SimulBench: Evaluating Language Models with Creative Simulation Tasks

Qi Jia; Xiang Yue; Tuney Zheng; Jie Huang; Bill Yuchen Lin

SimulBench: Evaluating Language Models with Creative Simulation Tasks

Qi Jia, Xiang Yue, Tuney Zheng, Jie Huang, Bill Yuchen Lin

Abstract

We introduce SimulBench, a benchmark designed to evaluate large language models (LLMs) across a diverse collection of creative simulation tasks, such as acting as a Linux terminal or playing text games with users. While these simulation tasks serve as effective measures of an LLM’s general intelligence, they are seldom incorporated into existing benchmarks. A major challenge is to develop an evaluation framework for testing different LLMs fairly while preserving the multi-round interactive nature of simulation tasks between users and AI. To tackle this issue, we suggest using a fixed LLM as a user agent to engage with an LLM to collect dialogues first under different tasks. Then, challenging dialogue scripts are extracted for evaluating different target LLMs. To facilitate automatic assessment on SimulBench, GPT-4 is employed as the evaluator, tasked with reviewing the quality of the final response generated by the target LLMs given multi-turn dialogue scripts. Our comprehensive experiments indicate that these creative simulation tasks continue to pose a significant challenge with their unique natures and show the gap between proprietary models and the most advanced open LLMs. For example, GPT-4-turbo outperforms LLaMA-3-70b-Chat on 18.55% more cases.

Anthology ID:: 2025.findings-naacl.453
Volume:: Findings of the Association for Computational Linguistics: NAACL 2025
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8118–8131
Language:
URL:: https://preview.aclanthology.org/landing_page/2025.findings-naacl.453/
DOI:
Bibkey:
Cite (ACL):: Qi Jia, Xiang Yue, Tuney Zheng, Jie Huang, and Bill Yuchen Lin. 2025. SimulBench: Evaluating Language Models with Creative Simulation Tasks. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 8118–8131, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: SimulBench: Evaluating Language Models with Creative Simulation Tasks (Jia et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2025.findings-naacl.453.pdf

PDF Cite Search Fix data