Yifan Zhu
Other people with similar names: Yifan Zhu
Unverified author pages with similar names: Yifan Zhu
2026
Beyond Pedagogical Principles: Multi-Horizon Preference Optimization for Efficient Socratic Tutoring
Xin Shi | Chao Zhang | Yifan Zhu | Xueqiao Zhang | Yawei Luo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xin Shi | Chao Zhang | Yifan Zhu | Xueqiao Zhang | Yawei Luo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The development of LLM-based tutor agents faces challenges in simultaneously ensuring adherence to pedagogical principles and achieving optimal pedagogical effectiveness, particularly in dynamic, multi-turn interactions. Existing methods are often constrained by static data or sparse reward signals in online settings. To address this gap, we propose Multi-Horizon Preference Optimization (MHPO), a novel framework that iteratively refines tutor agents using a multi-horizon reward function within a dynamic teacher-student simulation environment. Specifically, this reward function is designed to capture both turn-level pedagogical quality and trajectory-level pedagogical effectiveness, which is estimated via Monte Carlo rollouts. We further investigate two distinct strategies to aggregate these rewards for policy optimization. Our experiments demonstrate that MHPO significantly enhances base model performance, achieving a superior balance between principles and effectiveness compared to various baselines.