Yi Zhan
Other people with similar names: Yi Zhan, Yi Zhan
Unverified author pages with similar names: Yi Zhan
2026
Eval-RAR: Evaluation-Driven Retrieval-Augmented Reasoning via Reinforcement Learning
Heng Yu | Rui Li | Qi Liu | Wenjun Feng | Junfeng Kang | Yi Zhan
Findings of the Association for Computational Linguistics: ACL 2026
Heng Yu | Rui Li | Qi Liu | Wenjun Feng | Junfeng Kang | Yi Zhan
Findings of the Association for Computational Linguistics: ACL 2026
Retrieval-augmented generation (RAG) effectively extends the knowledge boundaries of large language models (LLMs) for complex tasks, yet current paradigms typically optimize for an interleaving of reasoning and retrieval, where models fail to critically evaluate retrieved information against the target question. Most existing methods rely on sparse outcome-based rewards, failing to provide explicit supervision for the internal reasoning process or to diagnose information inadequacy. To address this, we propose Eval-RAR, an Evaluation-driven Retrieval-Augmented Reasoning framework. Eval-RAR introduces a "Search-then-Evaluate" paradigm where the model performs explicit self-evaluation after each search step, generating a rationale to either identify sufficient evidence or specify missing information to guide subsequent queries. To optimize this process, we employ reinforcement learning with a fine-grained evaluation reward, providing intermediate feedback that encourages the model to track core entities and maintain logical consistency. Experiments on seven single-hop and multi-hop QA benchmarks demonstrate that Eval-RAR outperforms existing methods.
LongTutor: Benchmarking Large Language Models for Long-term Personalized Tutoring
Ning Li | Zheng Zhang | Zhenya Huang | Rui Li | Yi Zhan | Yinbo Luo | Qi Liu | Enhong Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ning Li | Zheng Zhang | Zhenya Huang | Rui Li | Yi Zhan | Yinbo Luo | Qi Liu | Enhong Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The rapid advancement of large language models (LLMs) has driven the deployment of LLM-based AI tutors on online learning platforms. This widespread adoption highlights an urgent need for systematic benchmarks to evaluate their tutoring capabilities. However, existing evaluations predominantly focus on isolated, short-term interactions, overlooking the inherently long-term nature of learning. To bridge this gap, we introduce LongTutor, a benchmark for long-term personalized tutoring grounded in formative assessment theory. Built from expert-annotated real-world learning logs, LongTutor evaluates LLMs across three progressive tasks: historical evidence acquisition, knowledge state diagnosis, and adaptive teaching action. Our experiments reveal a critical capability mismatch: while LLMs excel at evidence acquisition, they struggle to effectively leverage long-term history for accurate diagnosis and adaptive teaching. To enable scalable benchmark expansion, we further propose an automated generator–verifier pipeline, paving the way toward truly long-term AI tutoring systems.