Jinfa Huang

2024

Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries. To date, most of the state-of-the-art TVR methods learn image-to-video transfer learning based on the large-scale pre-trained vision-language models (e.g., CLIP). However, fully fine-tuning these pre-trained models for TVR incurs prohibitively expensive computation cost. To this end, we propose to conduct efficient text-video Retrieval with a salient-and-correlated AdaPter (RAP), i.e., fine-tuning the pre-trained model with a few parameterized layers. To accommodate the text-video scenario, we equip our RAP with two indispensable characteristics including temporal sparsity and correlation. Specifically, we propose a low-rank modulation module to refine the per-image features from frozen CLIP backbone, which accentuates silent frames within the video features while alleviating temporal redundancy. Besides, we introduce an asynchronous self-attention mechanism which firstly selects top responsive visual patch and augments the correlation modeling between them with learnable temporal and patch offsets. Extensive experiments on four TVR datasets demonstrate that our RAP achieves superior or comparable performance compared to the fully fine-tuned counterpart and other parameter-efficient finetuning methods.

Long-context Multimodal Large Language Models (MLLMs) demand substantial computational resources for inference as the growth of their multimodal Key-Value (KV) cache, in response to increasing input lengths, challenges memory and time efficiency. Unlike single-modality LLMs that manage only textual contexts, the KV cache of long-context MLLMs includes representations from multiple images with temporal and spatial relationships and related textual contexts. The predominance of image tokens means traditional optimizations for LLMs’ KV caches are unsuitable for multimodal long-context settings, and no prior works have addressed this challenge.In this work, we introduce **LOOK-M**, a pioneering, fine-tuning-free approach that efficiently reduces the multimodal KV cache size while maintaining performance comparable to a full cache. We observe that during prompt prefill, the model prioritizes more textual attention over image features, and based on the multimodal interaction observation, a new proposed text-prior method is explored to compress the KV cache. Furthermore, to mitigate the degradation of image contextual information, we propose several compensatory strategies using KV pairs merging. **LOOK-M** demonstrates that with a significant reduction in KV Cache memory usage, such as reducing it by **80%** in some cases, it not only achieves approximately **1.3x** faster decoding but also maintains or even **enhances** performance across a variety of long context multimodal tasks.

2020

pdf abs
Guoym at SemEval-2020 Task 8: Ensemble-based Classification of Visuo-Lingual Metaphor in Memes
Yingmei Guo | Jinfa Huang | Yanlong Dong | Mingxing Xu
Proceedings of the Fourteenth Workshop on Semantic Evaluation

In this paper, we describe our ensemble-based system designed by guoym Team for the SemEval-2020 Task 8, Memotion Analysis. In our system, we utilize five types of representation of data as input of base classifiers to extract information from different aspects. We train five base classifiers for each type of representation using five-fold cross-validation. Then the outputs of these base classifiers are combined through data-based ensemble method and feature-based ensemble method to make full use of all data and representations from different aspects. Our method achieves the performance within the top 2 ranks in the final leaderboard of Memotion Analysis among 36 Teams.

Co-authors

Ge Li 1

Che Liu 1

Venues

findings2
semeval1