LongTutor: Benchmarking Large Language Models for Long-term Personalized Tutoring

Ning Li; Zheng Zhang; Zhenya Huang; Rui Li; Yi Zhan; Yinbo Luo; Qi Liu; Enhong Chen

LongTutor: Benchmarking Large Language Models for Long-term Personalized Tutoring

Ning Li, Zheng Zhang, Zhenya Huang, Rui Li, Yi Zhan, Yinbo Luo, Qi Liu, Enhong Chen

Abstract

The rapid advancement of large language models (LLMs) has driven the deployment of LLM-based AI tutors on online learning platforms. This widespread adoption highlights an urgent need for systematic benchmarks to evaluate their tutoring capabilities. However, existing evaluations predominantly focus on isolated, short-term interactions, overlooking the inherently long-term nature of learning. To bridge this gap, we introduce LongTutor, a benchmark for long-term personalized tutoring grounded in formative assessment theory. Built from expert-annotated real-world learning logs, LongTutor evaluates LLMs across three progressive tasks: historical evidence acquisition, knowledge state diagnosis, and adaptive teaching action. Our experiments reveal a critical capability mismatch: while LLMs excel at evidence acquisition, they struggle to effectively leverage long-term history for accurate diagnosis and adaptive teaching. To enable scalable benchmark expansion, we further propose an automated generator–verifier pipeline, paving the way toward truly long-term AI tutoring systems.

Anthology ID:: 2026.acl-long.1371
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 29712–29737
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1371/
DOI:
Bibkey:
Cite (ACL):: Ning Li, Zheng Zhang, Zhenya Huang, Rui Li, Yi Zhan, Yinbo Luo, Qi Liu, and Enhong Chen. 2026. LongTutor: Benchmarking Large Language Models for Long-term Personalized Tutoring. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 29712–29737, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: LongTutor: Benchmarking Large Language Models for Long-term Personalized Tutoring (Li et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1371.pdf
Checklist:: 2026.acl-long.1371.checklist.pdf

PDF Cite Search Checklist Fix data