Shichen Lu


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2025

pdf bib
ViPE: Visual Perception in Parameter Space for Efficient Video-Language Understanding
Shichen Lu | Tongtian Yue | Longteng Guo | Handong Li | Xingjian He | Si Liu | Jing Liu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Existing video-language models (Video-LLMs) typically rely on concatenating visual tokens with textual inputs for joint modeling. However, this token-level alignment leads to significant inefficiency, especially when scaling to long videos with dense visual inputs. In this work, we propose a video-to-parameter efficiency paradigm named ViPE that eliminates redundant visual tokens by transforming video content into visual perceptual weights, which are directly injected into the LLM’s parameters. ViPE consists of a visual injection module that compresses video features into a small set of perceptual queries using a hierarchical merge strategy, and a visual perception module that integrates the resulting representations into the LLM through a lightweight LoRA-like mechanism. ViPE achieves performance comparable to token-based baselines such as LLaVA, while reducing FLOPs by 85% and inference time by up to 65%, demonstrating a highly efficient and scalable solution for video understanding.