Chuan He

2025

pdf bib abs
HomoGraphAdapter: A Homogeneous Graph Neural Network as an Effective Adapter for Vision-Language Models
Chuan He | Zhuozhao Li | Song Guo | Xiaocheng Lu | Jinxiang Lai
Findings of the Association for Computational Linguistics: EMNLP 2025

Vision-Language Models (VLMs), such as CLIP, have exhibited significant advancements in recognizing visual concepts through natural language guidance. However, adapting these models to downstream tasks remains challenging. Existing adaptation methods either overlook the structural knowledge between the text and image modalities or create overly complex graphs containing redundant information for alignment, leading to suboptimal classification performance and increased computational overhead. This paper proposes a novel adapter-tuning methodology named Homogeneous Graph Adapter (HomoGraphAdapter), which transforms diverse textual and visual descriptions into a unified set of node representations and establishes edges between nodes for inter-modal and cross-modal semantic alignment. We leverage a straightforward homogeneous Graph Neural Network (GNN) to adapt positive and negative classifiers across text and image modalities. The classifiers comprehensively enhance the performance for few-shot classification and OOD generalization. Compared with the SOTA approach HeGraphAdapter, HomoGraphAdapter improves classification accuracy by an average of 1.51% for 1-shot and 0.74% for 16-shot on 11 datasets, while also reducing both precomputation time and training time.

2023

Research interest in task-oriented dialogs has increased as systems such as Google Assistant, Alexa and Siri have become ubiquitous in everyday life. However, the impact of academic research in this area has been limited by the lack of datasets that realistically capture the wide array of user pain points. To enable research on some of the more challenging aspects of parsing realistic conversations, we introduce PRESTO, a public dataset of over 550K contextual multilingual conversations between humans and virtual assistants. PRESTO contains a diverse array of challenges that occur in real-world NLU tasks such as disfluencies, code-switching, and revisions. It is the only large scale human generated conversational parsing dataset that provides structured context such as a user’s contacts and lists for each example. Our mT5 model based baselines demonstrate that the conversational phenomenon present in PRESTO are challenging to model, which is further pronounced in a low-resource setup.