Chaofan Tao
2026
Find Your Optimal Teacher: Personalized Data Synthesis via Router-Guided Multi-Teacher Distillation
Hengyuan Zhang | Shiping Yang | Xiao Liang | Chenming Shang | Yuxuan Jiang | Chaofan Tao | Jing Xiong | Hayden Kwok-Hay So | Ruobing Xie | Angel X Chang | Ngai Wong
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hengyuan Zhang | Shiping Yang | Xiao Liang | Chenming Shang | Yuxuan Jiang | Chaofan Tao | Jing Xiong | Hayden Kwok-Hay So | Ruobing Xie | Angel X Chang | Ngai Wong
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Training student models on synthetic data generated by strong teacher models is a promising approach to distilling the capabilities of teachers. However, existing studies reveal that stronger models are not always optimal teachers, suggesting a mismatch between the teacher’s output and the student’s learning ability. To address this issue, we propose PerSyn (Personalized data Synthesis), a novel and efficient approach that customizes synthetic data to align with the learning capabilities of the student model. Specifically, our PerSyn method routes each prompt to its optimal teacher via a query-level router that jointly considers the student models’ learnability and teacher models’ response quality. It successfully transfers the synthesis paradigm from the conventional "Generate then Select" to a more efficient manner, i.e., "Route then Generate", eliminating the need for all teacher models to generate parallel responses across the entire prompt set. Extensive experiments across different model families and scales demonstrate that PerSyn consistently outperforms all baselines on six benchmarks, including instruct tuning and math reasoning settings. Further analysis verifies the effectiveness of PerSyn and offers extra insights to propel future research. Our code is available at https://anonymous.4open.science/r/PerSyn-8D85.
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models
Hengyuan Zhang | Zhihao Zhang | Ercong Nie | Mingyang Wang | Zunhai Su | Yiwei Wang | Qianli Wang | Shuzhou Yuan | Xufeng Duan | Qibo Xue | Zeping Yu | Chenming Shang | Xiao Liang | Jing Xiong | Hui Shen | Chaofan Tao | Zhengwu Liu | Senjie Jin | Zhiheng Xi | Dongdong Zhang | Sophia Ananiadou | Tao Gui | Ruobing Xie | Hayden Kwok-Hay So | Hinrich Schuetze | Xuanjing Huang | Qi Zhang | Ngai Wong
Findings of the Association for Computational Linguistics: ACL 2026
Hengyuan Zhang | Zhihao Zhang | Ercong Nie | Mingyang Wang | Zunhai Su | Yiwei Wang | Qianli Wang | Shuzhou Yuan | Xufeng Duan | Qibo Xue | Zeping Yu | Chenming Shang | Xiao Liang | Jing Xiong | Hui Shen | Chaofan Tao | Zhengwu Liu | Senjie Jin | Zhiheng Xi | Dongdong Zhang | Sophia Ananiadou | Tao Gui | Ruobing Xie | Hayden Kwok-Hay So | Hinrich Schuetze | Xuanjing Huang | Qi Zhang | Ngai Wong
Findings of the Association for Computational Linguistics: ACL 2026
Mechanistic Interpretability (MI) has emerged as a vital approach to demystify the opaque decision-making of Large Language Models (LLMs). However, existing reviews primarily treat MI as an observational science, summarizing analytical insights while lacking a systematic framework for actionable intervention. To bridge this gap, we present a practical survey structured around the pipeline: "Locate, Steer, and Improve." We formally categorize Localizing (diagnosis) and Steering (intervention) methods based on specific Interpretable Objects to establish a rigorous intervention protocol. Furthermore, we demonstrate how this framework enables tangible improvements in Alignment, Capability, and Efficiency, effectively operationalizing MI as a practical engineering toolkit for model optimization. The curated paper list of this work is available at https://anonymous.4open.science/r/Act-MI-F068.
2025
Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models
Taiqiang Wu | Chaofan Tao | Jiahao Wang | Runming Yang | Zhe Zhao | Ngai Wong
Proceedings of the 31st International Conference on Computational Linguistics
Taiqiang Wu | Chaofan Tao | Jiahao Wang | Runming Yang | Zhe Zhao | Ngai Wong
Proceedings of the 31st International Conference on Computational Linguistics
Kullback-Leiber divergence has been widely used in Knowledge Distillation (KD) to compress Large Language Models (LLMs). Contrary to prior assertions that reverse Kullback-Leibler (RKL) divergence is mode-seeking and thus preferable over the mean-seeking forward Kullback-Leibler (FKL) divergence, this study empirically and theoretically demonstrates that neither mode-seeking nor mean-seeking properties manifest in KD for LLMs. Instead, RKL and FKL are found to share the same optimization objective and both converge after a sufficient number of epochs. However, due to practical constraints, LLMs are seldom trained for such an extensive number of epochs. Meanwhile, we further find that RKL focuses on the tail part of the distributions, while FKL focuses on the head part at the beginning epochs. Consequently, we propose a simple yet effective Adaptive Kullback-Leiber (AKL) divergence method, which adaptively allocates weights to combine FKL and RKL. Metric-based and GPT-4-based evaluations demonstrate that the proposed AKL outperforms the baselines across various tasks and improves the diversity and quality of generated responses.
MEIT: Multimodal Electrocardiogram Instruction Tuning on Large Language Models for Report Generation
Zhongwei Wan | Che Liu | Xin Wang | Chaofan Tao | Hui Shen | Jing Xiong | Rossella Arcucci | Huaxiu Yao | Mi Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Zhongwei Wan | Che Liu | Xin Wang | Chaofan Tao | Hui Shen | Jing Xiong | Rossella Arcucci | Huaxiu Yao | Mi Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Electrocardiogram (ECG) is the primary non-invasive diagnostic tool for monitoring cardiac conditions and is crucial in assisting clinicians. Recent studies have concentrated on classifying cardiac conditions using ECG data but have overlooked ECG report generation, which is time-consuming and requires clinical expertise. To automate ECG report generation and ensure its versatility, we propose the Multimodal ECG Instruction Tuning (MEIT) framework, the first attempt to tackle ECG report generation with LLMs and multimodal instructions. To facilitate future research, we establish a benchmark to evaluate MEIT with various LLMs backbones across two large-scale ECG datasets. Our approach uniquely aligns the representations of the ECG signal and the report, and we conduct extensive experiments to benchmark MEIT with nine open-source LLMs using more than 800,000 ECG reports. MEIT’s results underscore the superior performance of instruction-tuned LLMs, showcasing their proficiency in quality report generation, zero-shot capabilities, resilience to signal perturbation, and alignment with human expert evaluation. These findings emphasize the efficacy of our MEIT framework and its potential for real-world clinical application.
UNComp: Can Matrix Entropy Uncover Sparsity? — A Compressor Design from an Uncertainty-Aware Perspective
Jing Xiong | Jianghan Shen | Fanghua Ye | Chaofan Tao | Zhongwei Wan | Jianqiao Lu | Xun Wu | Chuanyang Zheng | Zhijiang Guo | Min Yang | Lingpeng Kong | Ngai Wong
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Jing Xiong | Jianghan Shen | Fanghua Ye | Chaofan Tao | Zhongwei Wan | Jianqiao Lu | Xun Wu | Chuanyang Zheng | Zhijiang Guo | Min Yang | Lingpeng Kong | Ngai Wong
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Deploying large language models (LLMs) for long-context inference remains challenging due to their substantial memory and computational demands. While techniques such as Key-Value (KV) cache compression are designed to reduce memory usage, they often neglect the structured sparsity inherent in the relationship between hidden states and their corresponding KV cache. In this work, we explore the role of uncertainty as a potential indicator of sparsity within LLMs. We propose UNComp, an uncertainty-aware framework that leverages truncated matrix entropy to identify areas of low information content, thereby revealing sparsity patterns that can be used for adaptive compression. Unlike traditional methods that apply uniform compression, UNComp dynamically adjusts its approach to compression, guided by uncertainty measures that reflect the importance of various model components. Our analysis shows that sparsity patterns, when derived from uncertainty estimates, can be exploited to reveal special long-range dependencies, such as retrieval heads and retrieval layers. This perspective not only enhances our understanding of how compression can be optimized but also provides new insights into the inherent sparsity of LLMs during long-context inference. By focusing on uncertainty to analyze the sparsity pattern in detail, UNComp reduces the KV cache size to 4.74% of the original, achieves a 6% prefill speedup, and improves throughput by 6.4× — not only delivering strong lossless compression performance, but also validating the effectiveness of the underlying theoretical tool. Our codes are submitted with the paper.
2023
Structured Pruning for Efficient Generative Pre-trained Language Models
Chaofan Tao | Lu Hou | Haoli Bai | Jiansheng Wei | Xin Jiang | Qun Liu | Ping Luo | Ngai Wong
Findings of the Association for Computational Linguistics: ACL 2023
Chaofan Tao | Lu Hou | Haoli Bai | Jiansheng Wei | Xin Jiang | Qun Liu | Ping Luo | Ngai Wong
Findings of the Association for Computational Linguistics: ACL 2023
The increasing sizes of large generative Pre-trained Language Models (PLMs) hinder their deploymentin real-world applications. To obtain efficient PLMs, previous studies mostly focus on pruning the attention heads and feed-forward networks (FFNs) of the Transformer. Nevertheless, we find that in generative PLMs, the hidden dimension shared by many other modules (e.g., embedding layer and layer normalization) contains persistent outliers regardless of the network input. This study comprehensively investigates the structured pruning of generative PLMs with all the above compressible components. To identify redundant network structures, we assign learnable masks over compressible components followed by sparse training. Various sizes of PLMs can be flexibly extracted via different thresholds, and are then task-specifically fine-tuned for further improvement. Extensive experiments on language modeling, summarization and machine translation validate the effectiveness of the proposed method. For example, the pruned BART brings 1.51x/6.96x inference speedup on GPU/CPU with 67% size reduction, and can be further combined with quantization for more than 25× compression.
2022
LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling
Dongsheng Chen | Chaofan Tao | Lu Hou | Lifeng Shang | Xin Jiang | Qun Liu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Dongsheng Chen | Chaofan Tao | Lu Hou | Lifeng Shang | Xin Jiang | Qun Liu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Recent large-scale video-language pre-trained models have shown appealing performance on various downstream tasks. However, the pre-training process is computationally expensive due to the requirement of millions of video-text pairs and the redundant data structure of each video. To mitigate these problems, we propose LiteVL, which adapts a pre-trained image-language model BLIP into a video-text model directly on downstream tasks, without heavy pre-training. To enhance the temporal modeling lacking in the image-language model, we propose to add temporal attention modules in the image encoder of BLIP with dynamic temporal scaling. Besides the model-wise adaptation, we also propose a non-parametric pooling mechanism to adaptively reweight the fine-grained video embedding conditioned on the text. Experimental results on text-video retrieval and video question answering show that the proposed LiteVL even outperforms previous video-language pre-trained models by a clear margin, though without any video-language pre-training.
Compression of Generative Pre-trained Language Models via Quantization
Chaofan Tao | Lu Hou | Wei Zhang | Lifeng Shang | Xin Jiang | Qun Liu | Ping Luo | Ngai Wong
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chaofan Tao | Lu Hou | Wei Zhang | Lifeng Shang | Xin Jiang | Qun Liu | Ping Luo | Ngai Wong
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The increasing size of generative Pre-trained Language Models (PLMs) have greatly increased the demand for model compression. Despite various methods to compress BERT or its variants, there are few attempts to compress generative PLMs, and the underlying difficulty remains unclear. In this paper, we compress generative PLMs by quantization. We find that previous quantization methods fail on generative tasks due to the homogeneous word embeddings caused by reduced capacity and the varied distribution of weights. Correspondingly, we propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules. Empirical results on various tasks show that our proposed method outperforms the state-of-the-art compression methods on generative PLMs by a clear margin. With comparable performance with the full-precision models, we achieve 14.4x and 13.4x compression rate on GPT-2 and BART, respectively.
Search
Fix author
Co-authors
- Ngai Wong 6
- Jing Xiong 4
- Lu Hou 3
- Xin Jiang 3
- Qun Liu 3
- Xiao Liang (梁霄) 2
- Lifeng Shang 2
- Chenming Shang 2
- Hui Shen 2
- Hayden Kwok-Hay So 2
- Zhongwei Wan 2
- Ruobing Xie 2
- Hengyuan Zhang 2
- Sophia Ananiadou 1
- Rossella Arcucci 1
- Haoli Bai 1
- Angel X Chang 1
- Dongsheng Chen 1
- Xufeng Duan 1
- Tao Gui 1
- Zhijiang Guo 1
- Xuan-Jing Huang (黄萱菁) 1
- Yuxuan Jiang 1
- Senjie Jin 1
- Lingpeng Kong 1
- Che Liu 1
- Zhengwu Liu 1
- Jianqiao Lu 1
- Ping Luo 1
- Ping Luo 1
- Ercong Nie 1
- Hinrich Schuetze 1
- Jianghan Shen 1
- Zunhai Su 1
- Jiahao Wang 1
- Xin Wang 1
- Mingyang Wang 1
- Yiwei Wang 1
- Qianli Wang 1
- Jiansheng Wei 1
- Taiqiang Wu 1
- Xun Wu 1
- Zhiheng Xi 1
- Qibo Xue 1
- Runming Yang 1
- Min Yang 1
- Shiping Yang 1
- Huaxiu Yao 1
- Fanghua Ye 1
- Zeping Yu 1
- Shuzhou Yuan 1
- Mi Zhang 1
- Wei Zhang 1
- Zhihao Zhang 1
- Dongdong Zhang 1
- Qi Zhang 1
- Zhe Zhao 1
- Chuanyang Zheng 1