Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

Yuming Yang; Mingyoung Lai; Wanxu Zhao; Xiaoran Fan; Zhiheng Xi; Mingqi Wu; Chiyue Huang; Jun Zhao; Haijun Lv; Jian Tong; Yunhua Zhou; Yicheng Zou; Qipeng Guo; Tao Gui; Qi Zhang; Xuan-Jing Huang (黄萱菁)

Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

Yuming Yang, Mingyoung Lai, Wanxu Zhao, Xiaoran Fan, Zhiheng Xi, Mingqi Wu, Chiyue Huang, Jun Zhao, Haijun Lv, Jian Tong, Yunhua Zhou, Yicheng Zou, Qipeng Guo, Tao Gui, Qi Zhang, Xuanjing Huang

Abstract

Long chain-of-thought (CoT) trajectories provide rich supervision signals for distilling reasoning from teacher to student LLMs. However, both prior work and our experiments show that trajectories from stronger teachers do not necessarily yield better students, highlighting the importance of data-student suitability in distillation. Existing methods assess suitability primarily through student likelihood, favoring trajectories that align closely with the student model’s current behavior but overlooking more informative ones. Addressing this, we propose Rank–Surprisal Ratio (RSR), a simple metric that captures both alignment and informativeness to assess the suitability of a reasoning trajectory. RSR is motivated by the observation that effective trajectories typically balance learning signal strength and behavioral alignment by combining low absolute probability with relatively high-ranked tokens under the student model.Concretely, RSR is defined as the ratio of a trajectory’s average token-wise rank to its average negative log-likelihood, and is straightforward to compute and interpret. Across five student models and reasoning trajectories from 11 diverse teachers, RSR strongly correlates with post-training reasoning performance (average Spearman 0.86), consistently outperforming existing metrics. We further demonstrate its practical utility in both trajectory selection and teacher selection.

Anthology ID:: 2026.acl-long.1950
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 42123–42150
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1950/
DOI:
Bibkey:
Cite (ACL):: Yuming Yang, Mingyoung Lai, Wanxu Zhao, Xiaoran Fan, Zhiheng Xi, Mingqi Wu, Chiyue Huang, Jun Zhao, Haijun Lv, Jian Tong, Yunhua Zhou, Yicheng Zou, Qipeng Guo, Tao Gui, Qi Zhang, and Xuanjing Huang. 2026. Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 42123–42150, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment (Yang et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1950.pdf
Checklist:: 2026.acl-long.1950.checklist.pdf

PDF Cite Search Checklist Fix data