Jun Zhang
Other people with similar names: Jun Zhang, Jun Zhang, Jun Zhang, Jun Zhang, Jun Zhang, Jun Zhang, Jun Zhang
Unverified author pages with similar names: Jun Zhang
2026
KNN-SSD: Enabling Dynamic Self-Speculative Decoding via Nearest Neighbor Layer Set Optimization
Mingbo Song | Heming Xia | Jun Zhang | Chak Tou Leong | Qiancheng Xu | Wenjie Li | Sujian Li
Findings of the Association for Computational Linguistics: EACL 2026
Mingbo Song | Heming Xia | Jun Zhang | Chak Tou Leong | Qiancheng Xu | Wenjie Li | Sujian Li
Findings of the Association for Computational Linguistics: EACL 2026
Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects
Jun Zhang | Yicheng Ji | Feiyang Ren | Yihang Li | Bowen Zeng | Zonghao Chen | Ke Chen | Lidan Shou | Gang Chen | Huan Li
Findings of the Association for Computational Linguistics: ACL 2026
Jun Zhang | Yicheng Ji | Feiyang Ren | Yihang Li | Bowen Zeng | Zonghao Chen | Ke Chen | Lidan Shou | Gang Chen | Huan Li
Findings of the Association for Computational Linguistics: ACL 2026
Large Vision-Language Models (LVLMs) enable sophisticated reasoning over images and videos, yet their inference is hindered by a systemic efficiency barrier known as visual token dominance. This overhead is driven by a multi-regime interplay between high-resolution feature extraction, quadratic attention scaling, and memory bandwidth constraints. We present a systematic taxonomy of efficiency techniques structured around the inference lifecycle, consisting of encoding, prefilling, and decoding. Unlike prior reviews focused on isolated optimizations, we analyze the end-to-end pipeline to reveal how upstream decisions dictate downstream bottlenecks, covering compute-bound visual encoding, the intensive prefilling of massive contexts, and the ”visual memory wall” in bandwidth-bound decoding. By decoupling the efficiency landscape into the axes of shaping information density, managing long-context attention, and overcoming memory limits, this work provides a structured analysis of how isolated optimizations compose to navigate the trade-off between visual fidelity and system efficiency. The survey concludes by outlining four future frontiers supported by pilot empirical insights, including hybrid compression based on functional unit sensitivity, modality-aware decoding with relaxed verification, progressive state management for streaming continuity, and stage-disaggregated serving through hardware-algorithm co-design. The submitted software contains a snapshot of our literature repository, which is designed to be maintained as a living resource for the community.
HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference
Bowen Zeng | Feiyang Ren | Jun Zhang | Xiaoling Gu | Ke Chen | Lidan Shou | Huan Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Bowen Zeng | Feiyang Ren | Jun Zhang | Xiaoling Gu | Ke Chen | Lidan Shou | Huan Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multimodal Large Language Models (MLLMs) have advanced unified reasoning over text, images, and videos, but their inference is hindered by the rapid growth of key–value (KV) caches. Each visual input expands into thousands of tokens, causing caches to scale linearly with context length and remain resident in GPU memory throughout decoding, which leads to prohibitive memory overhead and latency even on high-end GPUs. A common solution is to compress caches under a fixed allocated budget at different granularities: token-level uniformly discards less important tokens, layer-level varies retention across layers, and head-level redistributes budgets across heads. Yet these approaches stop at allocation and overlook the heterogeneous behaviors of attention heads that require distinct compression strategies. We propose HybridKV, a hybrid KV cache compression framework that integrates complementary strategies in three stages: heads are first classified into static or dynamic types using text-centric attention; then a top-down budget allocation scheme hierarchically assigns KV budgets; finally, static heads are compressed by text-prior pruning and dynamic heads by chunk-wise retrieval. Experiments on 11 multimodal benchmarks with Qwen2.5-VL-7B show that HybridKV reduces KV cache memory by up to 7.9× and achieves 1.52× faster decoding, with almost no performance drop or even higher relative to the full-cache MLLM.
See the Forest for the Trees: Loosely Speculative Decoding via Visual-Semantic Guidance for Efficient Inference of Video LLMs
Yicheng Ji | Jun Zhang | Jinpeng Chen | Cong Wang | Lidan Shou | Gang Chen | Huan Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yicheng Ji | Jun Zhang | Jinpeng Chen | Cong Wang | Lidan Shou | Gang Chen | Huan Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Video Large Language Models (Video-LLMs) excel in video understanding but suffer from high inference latency due to autoregressive generation. Speculative Decoding (SD) mitigates this by applying a draft-and-verify paradigm, yet existing methods are constrained by rigid exact-match rules, severely limiting the acceleration potential. To bridge this gap, we propose LVSpec, the first training-free loosely SD framework tailored for Video-LLMs. Grounded in the insight that generation is governed by sparse visual-relevant anchors (mandating strictness) amidst abundant visual-irrelevant fillers (permitting loose verification), LVSpec employs a lightweight visual-relevant token identification scheme to accurately pinpoint the former. To further maximize acceptance, we augment this with a position-shift tolerant mechanism that effectively salvages positionally mismatched but semantically equivalent tokens. Experiments demonstrate that LVSpec is high-fidelity and rapid: it preserves >99.8% of target performance while accelerating Qwen2.5-VL-32B by 2.70 × and LLaVA-OneVision-72B by 2.94 ×. Notably, it boosts the mean accepted length and speedup ratio by 136% and 35% compared to SOTA training-free SD methods for Video-LLMs. Code is provided in the submitted software.
2025
SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning
Yicheng Ji | Jun Zhang | Heming Xia | Jinpeng Chen | Lidan Shou | Gang Chen | Huan Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Yicheng Ji | Jun Zhang | Heming Xia | Jinpeng Chen | Lidan Shou | Gang Chen | Huan Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Video large language models (Vid-LLMs) have shown strong capabilities in understanding video content. However, their reliance on dense video token representations introduces substantial memory and computational overhead in both prefilling and decoding. To mitigate the information loss of recent video token reduction methods and accelerate the decoding stage of Vid-LLMs losslessly, we introduce SpecVLM, a training-free speculative decoding (SD) framework tailored for Vid-LLMs that incorporates staged video token pruning.Building on our novel finding that the draft model’s speculation exhibits low sensitivity to video token pruning, SpecVLM prunes up to 90% of video tokens to enable efficient speculation without sacrificing accuracy. To achieve this, we performs a two-stage pruning process: Stage I selects highly informative tokens guided by attention signals from the verifier (target model), while Stage II prunes remaining redundant ones in a spatially uniform manner.Extensive experiments on four video understanding benchmarks demonstrate the effectiveness and robustness of SpecVLM, which achieves up to 2.68× decoding speedup for LLaVA-OneVision-72B and 2.11× speedup for Qwen2.5-VL-32B. Code is available at https://github.com/zju-jiyicheng/SpecVLM.