Qi Chu
2026
When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors
Chenghao Yang | Yuning Zhang | Zhoufutu Wen | Tao Gong | Jiaheng Liu | Qi Chu | Nenghai Yu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chenghao Yang | Yuning Zhang | Zhoufutu Wen | Tao Gong | Jiaheng Liu | Qi Chu | Nenghai Yu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Model distillation is a primary driver behind the rapid progress of LLM agents, yet it often leads to behavioral homogenization. Many emerging agents share nearly identical reasoning steps and failure modes, suggesting they may be distilled echoes of a few dominant teachers. Existing metrics, however, fail to distinguish mandatory behaviors required for task success from non-mandatory patterns thatreflect a model’s autonomous preferences. We propose two complementary metrics to isolate non-mandatory behavioral patterns: Response Pattern Similarity (RPS) for verbal alignment and Action Graph Similarity (AGS) for tool-use habits modeled as directed graphs. Evaluating 18 models from 8 providers on 𝜏-Bench and 𝜏2-Bench against Claude Sonnet 4.5 (thinking), we find that within-family model pairs score 5.9 pp higher in AGS than cross-family pairs, and that Kimi-K2 (thinking) reaches 82.6% Snode and 94.7% Sdep, exceeding Anthropic’s own Opus 4.1. A controlled distillation experiment further confirms that AGS distinguishes teacher-specific convergence from general improvement. RPS and AGS capture distinct behavioral dimensions (Pearson r = 0.491), providing complementary diagnostic signals for behavioral convergence in the agent ecosystem.Our code is available at https://github.com/Syuchin/AgentEcho.
2025
MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation
Chenghao Yang | Yinbo Luo | Zhoufutu Wen | Qi Chu | Tao Gong | Longxiang Liu | Kaiyuan Zhang | Jianpeng Jiao | Ge Zhang | Wenhao Huang | Nenghai Yu
Findings of the Association for Computational Linguistics: EMNLP 2025
Chenghao Yang | Yinbo Luo | Zhoufutu Wen | Qi Chu | Tao Gong | Longxiang Liu | Kaiyuan Zhang | Jianpeng Jiao | Ge Zhang | Wenhao Huang | Nenghai Yu
Findings of the Association for Computational Linguistics: EMNLP 2025
Large Language Models (LLMs), e.g. ChatGPT, have been widely adopted in real-world dialogue applications. However, LLMs’ robustness, especially in handling long complex dialogue sessions, including frequent motivation transfer, sophisticated cross-turn dependency, is criticized all along. Nevertheless, no existing benchmarks can fully reflect these weaknesses. We present MARS-Bench, a Multi-turn Athletic Real-world Scenario Dialogue Benchmark, designed to remedy the gap. MARS-Bench is constructed from play-by-play text commentary so to feature realistic dialogues specifically designed to evaluate three critical aspects of multi-turn conversations: ultra multi-turn, interactive multi-turn, and cross-turn tasks. Extensive experiments on MARS-Bench also reveal that closed-source LLMs significantly outperform open-source alternatives, explicit reasoning significantly boosts LLMs’ robustness on handling long complex dialogue sessions, and LLMs indeed face significant challenge when handling motivation transfer and sophisticated cross-turn dependency. Moreover, we provide mechanistic interpretability on how attention sinks due to special tokens lead to LLMs’ performance degradation when handling long complex dialogue sessions based on attention visualization experiment in Qwen2.5-7B-Instruction.
2024
Llama SLayer 8B: Shallow Layers Hold the Key to Knowledge Injection
Tianxiang Chen | Zhentao Tan | Tao Gong | Yue Wu | Qi Chu | Bin Liu | Jieping Ye | Nenghai Yu
Findings of the Association for Computational Linguistics: EMNLP 2024
Tianxiang Chen | Zhentao Tan | Tao Gong | Yue Wu | Qi Chu | Bin Liu | Jieping Ye | Nenghai Yu
Findings of the Association for Computational Linguistics: EMNLP 2024
As a manner to augment pretrained large language models (LLM), knowledge injection is critical to develop vertical domain large models and has been widely studied. While most current approaches, including parameter-efficient fine-tuning (PEFT) and block expansion methods, uniformly apply knowledge across all LLM layers, it raises the question: are all layers equally crucial for knowledge injection? We embark upon evaluating the importance of each layer to locate the optimal layer range for knowledge injection. Intuitively, more important layers should play more critical roles in knowledge injection and deserve denser injection. We observe performance dips in question-answering benchmarks after the removal or expansion of the shallow layers, and the degradation shrinks as the layer gets deeper, indicating that the shallow layers hold the key to knowledge injection. This insight leads us to propose the S strategy, a post-pretraining strategy of selectively enhancing shallow layers while pruning the less effective deep ones. Based on this strategy, we introduce Llama Slayer 8B. We experimented on the corpus of code & math and demonstrated the effectiveness of our strategy. Further experiments across different LLM, Mistral-7B, and a legal corpus confirmed the approach’s general applicability, underscoring its wide-ranging efficacy.