Tao Gong

2025

Large Language Models (LLMs), e.g. ChatGPT, have been widely adopted in real-world dialogue applications. However, LLMs’ robustness, especially in handling long complex dialogue sessions, including frequent motivation transfer, sophisticated cross-turn dependency, is criticized all along. Nevertheless, no existing benchmarks can fully reflect these weaknesses. We present MARS-Bench, a Multi-turn Athletic Real-world Scenario Dialogue Benchmark, designed to remedy the gap. MARS-Bench is constructed from play-by-play text commentary so to feature realistic dialogues specifically designed to evaluate three critical aspects of multi-turn conversations: ultra multi-turn, interactive multi-turn, and cross-turn tasks. Extensive experiments on MARS-Bench also reveal that closed-source LLMs significantly outperform open-source alternatives, explicit reasoning significantly boosts LLMs’ robustness on handling long complex dialogue sessions, and LLMs indeed face significant challenge when handling motivation transfer and sophisticated cross-turn dependency. Moreover, we provide mechanistic interpretability on how attention sinks due to special tokens lead to LLMs’ performance degradation when handling long complex dialogue sessions based on attention visualization experiment in Qwen2.5-7B-Instruction.

pdf bib abs
When Allies Turn Foes: Exploring Group Characteristics of LLM-Based Multi-Agent Collaborative Systems Under Adversarial Attacks
Jiahao Zhang | Baoshuo Kan | Tao Gong | Fu Lee Wang | Tianyong Hao
Findings of the Association for Computational Linguistics: EMNLP 2025

This paper investigates the group characteristics in multi-agent collaborative systems under adversarial attacks. Adversarial agents are tasked with generating counterfactual answers to a given collaborative problem, while collaborative agents normally interact with other agents to solve the given problem. To simulate real-world collaboration scenarios as closely as possible, we evaluate the collaborative system in three different collaboration scenarios and design three different communication strategies and different group structures. Furthermore, we explored several methods to mitigate adversarial attacks, all of which have been proven effective through our experiments. To quantify the robustness of collaborative systems against such attacks, a novel metric, System Defense Index (SDI), is introduced. Finally, we conducted an in-depth analysis from the perspective of group dynamics on how adversarial agents affect multi-agent collaborative systems, which reveals similarities between the agent collaboration process and human collaboration process. The code will be made available after publication.

2024

As a manner to augment pretrained large language models (LLM), knowledge injection is critical to develop vertical domain large models and has been widely studied. While most current approaches, including parameter-efficient fine-tuning (PEFT) and block expansion methods, uniformly apply knowledge across all LLM layers, it raises the question: are all layers equally crucial for knowledge injection? We embark upon evaluating the importance of each layer to locate the optimal layer range for knowledge injection. Intuitively, more important layers should play more critical roles in knowledge injection and deserve denser injection. We observe performance dips in question-answering benchmarks after the removal or expansion of the shallow layers, and the degradation shrinks as the layer gets deeper, indicating that the shallow layers hold the key to knowledge injection. This insight leads us to propose the S strategy, a post-pretraining strategy of selectively enhancing shallow layers while pruning the less effective deep ones. Based on this strategy, we introduce Llama Slayer 8B. We experimented on the corpus of code & math and demonstrated the effectiveness of our strategy. Further experiments across different LLM, Mistral-7B, and a legal corpus confirmed the approach’s general applicability, underscoring its wide-ranging efficacy.

2019

pdf bib abs
Similar Minds Post Alike: Assessment of Suicide Risk Using a Hybrid Model
Lushi Chen | Abeer Aldayel | Nikolay Bogoychev | Tao Gong
Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology

This paper describes our system submission for the CLPsych 2019 shared task B on suicide risk assessment. We approached the problem with three separate models: a behaviour model; a language model and a hybrid model. For the behavioral model approach, we model each user’s behaviour and thoughts with four groups of features: posting behaviour, sentiment, motivation, and content of the user’s posting. We use these features as an input in a support vector machine (SVM). For the language model approach, we trained a language model for each risk level using all the posts from the users as the training corpora. Then, we computed the perplexity of each user’s posts to determine how likely his/her posts were to belong to each risk level. Finally, we built a hybrid model that combines both the language model and the behavioral model, which demonstrates the best performance in detecting the suicide risk level.