Xinyue Liang
2026
EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios
Bin Xu | Yu Bai | Huashan Sun | Yiguan Lin | Siming Liu | Xinyue Liang | Yaolin Li | Zhuangzhi Dong | Jingren Zhang | Yufan Deng | Xinyu Zou | Yang Gao | Heyan Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Bin Xu | Yu Bai | Huashan Sun | Yiguan Lin | Siming Liu | Xinyue Liang | Yaolin Li | Zhuangzhi Dong | Jingren Zhang | Yufan Deng | Xinyu Zou | Yang Gao | Heyan Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
As large language models continue to advance, their application in educational contexts remains underexplored and under-optimized. In this paper, we address this gap by introducing the first diverse benchmark tailored for educational scenarios, incorporating synthetic data containing 9 major scenarios and over 4,000 distinct educational contexts. To enable comprehensive assessment, we propose a set of multi-dimensional evaluation metrics that cover 12 critical aspects relevant to both teachers and students. We further apply human annotation to ensure the effectiveness of the model-generated evaluation responses. Additionally, we succeed to train a relatively small-scale model on our constructed dataset and demonstrate that it can achieve performance comparable to state-of-the-art large models (e.g., Deepseek V3, Qwen Max) on the test set. Overall, this work provides a practical foundation for the development and evaluation of education-oriented language models.
2024
Bit_numeval at SemEval-2024 Task 7: Enhance Numerical Sensitivity and Reasoning Completeness for Quantitative Understanding
Xinyue Liang | Jiawei Li | Yizhe Yang | Yang Gao
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
Xinyue Liang | Jiawei Li | Yizhe Yang | Yang Gao
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
In this paper, we describe the methods used for Quantitative Natural Language Inference (QNLI), and Quantitative Question Answering (QQA) in task1 of Semeval2024 NumEval. The challenge’s focus is to enhance the model’s quantitative understanding consequently improving its performance on certain tasks. We accomplish this task from two perspectives: (1) By integrating real-world numerical comparison data during the supervised fine-tuning (SFT) phase, we enhanced the model’s numerical sensitivity. (2) We develop an innovative reward model scoring mechanism, leveraging reinforcement learning from human feedback (RLHF) techniques to improve the model’s reasoning completeness.