Siming Liu
2026
EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios
Bin Xu | Yu Bai | Huashan Sun | Yiguan Lin | Siming Liu | Xinyue Liang | Yaolin Li | Zhuangzhi Dong | Jingren Zhang | Yufan Deng | Xinyu Zou | Yang Gao | Heyan Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Bin Xu | Yu Bai | Huashan Sun | Yiguan Lin | Siming Liu | Xinyue Liang | Yaolin Li | Zhuangzhi Dong | Jingren Zhang | Yufan Deng | Xinyu Zou | Yang Gao | Heyan Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
As large language models continue to advance, their application in educational contexts remains underexplored and under-optimized. In this paper, we address this gap by introducing the first diverse benchmark tailored for educational scenarios, incorporating synthetic data containing 9 major scenarios and over 4,000 distinct educational contexts. To enable comprehensive assessment, we propose a set of multi-dimensional evaluation metrics that cover 12 critical aspects relevant to both teachers and students. We further apply human annotation to ensure the effectiveness of the model-generated evaluation responses. Additionally, we succeed to train a relatively small-scale model on our constructed dataset and demonstrate that it can achieve performance comparable to state-of-the-art large models (e.g., Deepseek V3, Qwen Max) on the test set. Overall, this work provides a practical foundation for the development and evaluation of education-oriented language models.