Zhoufutu Wen
2026
CoTJudger: A Graph-Driven Framework for Automatic Evaluation of Chain-of-Thought Efficiency and Redundancy in LRMs
Siyi Li | Jiajun Shi | Shiwen Ni | Ge Zhang | Shuaimin Li | Shijian Wang | Zhoufutu Wen | Yizhi LI | Hamid Alinejad-Rokny | Jiaheng Liu | Min Yang | Wenhao Huang
Findings of the Association for Computational Linguistics: ACL 2026
Siyi Li | Jiajun Shi | Shiwen Ni | Ge Zhang | Shuaimin Li | Shijian Wang | Zhoufutu Wen | Yizhi LI | Hamid Alinejad-Rokny | Jiaheng Liu | Min Yang | Wenhao Huang
Findings of the Association for Computational Linguistics: ACL 2026
Large Reasoning Models (LRMs) have demonstrated strong performance by producing extended Chain-of-Thought (CoT) traces before answering. However, this paradigm often induces over-reasoning: redundant calculations and circular self-verification that increase computational cost without improving outcomes. Existing evaluations largely emphasize final accuracy or coarse token counts, and lack automated tools to separate essential logic from structural redundancy. We introduce CoTJudger, a graph-driven framework that quantifies reasoning efficiency by converting free-form CoTs into directed dependency graphs and extracting the Shortest Effective Path (SEP) needed to reach a correct solution. This yields an interpretable efficiency signal – how much of a CoT is necessary versus structurally redundant – that is comparable across models and tasks. Evaluating 21 LRMs, CoTJudger reveals pervasive redundancy and surfaces recurring failure modes, including verification obsession and compensatory redundancy. These results provide a practical metric for disentangling reasoning ability from computational waste, enabling more targeted evaluation and diagnosis of LRM efficiency.
When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors
Chenghao Yang | Yuning Zhang | Zhoufutu Wen | Tao Gong | Jiaheng Liu | Qi Chu | Nenghai Yu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chenghao Yang | Yuning Zhang | Zhoufutu Wen | Tao Gong | Jiaheng Liu | Qi Chu | Nenghai Yu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Model distillation is a primary driver behind the rapid progress of LLM agents, yet it often leads to behavioral homogenization. Many emerging agents share nearly identical reasoning steps and failure modes, suggesting they may be distilled echoes of a few dominant teachers. Existing metrics, however, fail to distinguish mandatory behaviors required for task success from non-mandatory patterns thatreflect a model’s autonomous preferences. We propose two complementary metrics to isolate non-mandatory behavioral patterns: Response Pattern Similarity (RPS) for verbal alignment and Action Graph Similarity (AGS) for tool-use habits modeled as directed graphs. Evaluating 18 models from 8 providers on 𝜏-Bench and 𝜏2-Bench against Claude Sonnet 4.5 (thinking), we find that within-family model pairs score 5.9 pp higher in AGS than cross-family pairs, and that Kimi-K2 (thinking) reaches 82.6% Snode and 94.7% Sdep, exceeding Anthropic’s own Opus 4.1. A controlled distillation experiment further confirms that AGS distinguishes teacher-specific convergence from general improvement. RPS and AGS capture distinct behavioral dimensions (Pearson r = 0.491), providing complementary diagnostic signals for behavioral convergence in the agent ecosystem.Our code is available at https://github.com/Syuchin/AgentEcho.
2025
MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation
Chenghao Yang | Yinbo Luo | Zhoufutu Wen | Qi Chu | Tao Gong | Longxiang Liu | Kaiyuan Zhang | Jianpeng Jiao | Ge Zhang | Wenhao Huang | Nenghai Yu
Findings of the Association for Computational Linguistics: EMNLP 2025
Chenghao Yang | Yinbo Luo | Zhoufutu Wen | Qi Chu | Tao Gong | Longxiang Liu | Kaiyuan Zhang | Jianpeng Jiao | Ge Zhang | Wenhao Huang | Nenghai Yu
Findings of the Association for Computational Linguistics: EMNLP 2025
Large Language Models (LLMs), e.g. ChatGPT, have been widely adopted in real-world dialogue applications. However, LLMs’ robustness, especially in handling long complex dialogue sessions, including frequent motivation transfer, sophisticated cross-turn dependency, is criticized all along. Nevertheless, no existing benchmarks can fully reflect these weaknesses. We present MARS-Bench, a Multi-turn Athletic Real-world Scenario Dialogue Benchmark, designed to remedy the gap. MARS-Bench is constructed from play-by-play text commentary so to feature realistic dialogues specifically designed to evaluate three critical aspects of multi-turn conversations: ultra multi-turn, interactive multi-turn, and cross-turn tasks. Extensive experiments on MARS-Bench also reveal that closed-source LLMs significantly outperform open-source alternatives, explicit reasoning significantly boosts LLMs’ robustness on handling long complex dialogue sessions, and LLMs indeed face significant challenge when handling motivation transfer and sophisticated cross-turn dependency. Moreover, we provide mechanistic interpretability on how attention sinks due to special tokens lead to LLMs’ performance degradation when handling long complex dialogue sessions based on attention visualization experiment in Qwen2.5-7B-Instruction.
Quantification of Large Language Model Distillation
Sunbowen Lee | Junting Zhou | Chang Ao | Kaige Li | Xeron Du | Sirui He | Haihong Wu | Tianci Liu | Jiaheng Liu | Hamid Alinejad-Rokny | Min Yang | Yitao Liang | Zhoufutu Wen | Shiwen Ni
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Sunbowen Lee | Junting Zhou | Chang Ao | Kaige Li | Xeron Du | Sirui He | Haihong Wu | Tianci Liu | Jiaheng Liu | Hamid Alinejad-Rokny | Min Yang | Yitao Liang | Zhoufutu Wen | Shiwen Ni
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Model distillation is a fundamental technique in building large language models (LLMs), transferring knowledge from a teacher model to a student model. However, distillation can lead to model homogenization, reducing diversity among models and impairing their ability to robustly handle complex or novel tasks. These limitations underscore the need to systematically quantify the distillation process and its impact. In this work, we propose a framework to evaluate and quantify model distillation. Our method addresses two key aspects: (1) Identifying identity cognition contradictions to assess discrepancies in how models perceive and represent identity-related information, and (2) Analyzing multi-granularity response similarities across models to measure the extent of homogenization. Experimental results demonstrate two key insights: (1) Well-known closed-source and open-source LLMs usually exhibit high distillation degrees, except for Claude, Doubao, and Gemini. (2) Base LLMs show higher distillation degrees compared to aligned LLMs. By offering a systematic approach to improve the transparency of LLM data distillation, we call for LLMs with more independent development and more transparent technical reports to improve LLMs’ robustness and safety. The code and data are available at https://github.com/Aegis1863/LLMs-Distillation-Quantification.