Hao Zhou
Other people with similar names: Hao Zhou, Hao Zhou, Hao Zhou
Unverified author pages with similar names: Hao Zhou
2026
APB-V: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention
Yuxiang Huang | Mingye Li | Xu Han | Chaojun Xiao | Weilin Zhao | Ao Sun | Ziqi Yuan | Hao Zhou | Fandong Meng | Zhiyuan Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuxiang Huang | Mingye Li | Xu Han | Chaojun Xiao | Weilin Zhao | Ao Sun | Ziqi Yuan | Hao Zhou | Fandong Meng | Zhiyuan Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The efficiency of long-video inference remains a critical bottleneck, mainly due to the dense computation in the prefill stage of Large Multimodal Models (LMMs). Existing methods either compress visual embeddings or apply sparse attention on a single GPU, yielding limited acceleration or degraded performance and restricting LMMs from handling longer, more complex videos. To overcome these issues, we propose APB-V, a sequence-parallel framework with optimized attention that accelerates long-video inference across multiple GPUs. By distributing approximate attention, APB-V reduces computation and increases parallelism, enabling efficient processing of more visual embeddings without compression and thereby improving task performance. System-level optimizations, such as load balancing and fused forward passes, further unleash the potential of APB-V, delivering speedups of 12.72x, 1.70x, and 1.18x over FlashAttn, ZigZagRing, and APB, without notable performance loss.
Investigating Cross-Modal Skill Injection: Scenarios, Methods, and Hyperparameters
Zhiyu Xu | Lean Wang | Yuanxin Liu | Lei Li | Hao Zhou | Fandong Meng | Jie Zhou | Xu Sun
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhiyu Xu | Lean Wang | Yuanxin Liu | Lei Li | Hao Zhou | Fandong Meng | Jie Zhou | Xu Sun
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Vision-Language Models (VLMs) have demonstrated remarkable proficiency in general multi-modal understanding; yet they struggle to efficiently acquire continually evolving domain-specific skills. Conventional approaches to enhancing VLM capabilities, such as Supervised Fine-Tuning (SFT), require extensive dataset curation and substantial computational resources. Model merging has emerged as an efficient alternative that enables the transfer of domain-specific expertise from Large Language Models (LLMs) to VLMs without incurring additional training data requirements or significant computational overhead. Unlike conventional merging of homogeneous LLMs, which mainly aggregates existing capabilities, cross-modal skill injection aims to induce emergent cross-modal capabilities by integrating a domain-expert LLM into a VLM. However, existing research lacks a systematic analysis of the applicability and methodology of cross-modal skill injection. In this study, we investigate cross-modal skill injection across three main aspects: scenarios, methods, and hyperparameters. For scenarios, we find that cross-modal skill injection generally performs well in instruction-following and cross-lingual settings, yet struggles with mathematical reasoning. For methods, we find that classic approaches such as TA and DARE consistently achieve superior performance over alternative merging methods. We also provide a systematic and quantitative analysis of the hyperparameter tuning that these classic methods critically depend on.