Feng Hu

2026

While Large Vision-Language Models (LVLMs) have demonstrated remarkable proficiency in image captioning, existing research primarily focuses on real-world scenarios, leaving surreal, highly stylized, and semantically hybrid virtual-world scenarios significantly underexplored. In this work, we introduce Game Character Captioning, a novel task designed to evaluate LVLMs’ capability to perceive and describe game character from the virtual-world. To facilitate evaluation, we establish GC-Bench, a manually annotated benchmark, and propose Graph-F1 to effectively assess performance on this task. Our evaluation reveals that: (1) current state-of-the-art LVLMs, including closed-source giants such as and , struggle to maintain the high performance seen in real-world scenarios; and (2) a notable gap exists between open-source and closed-source models. To bridge this gap, we construct GC-148K, a large-scale dataset generated via a specialized data pipeline, and develop the G-Cap series. Experiments demonstrate that G-Cap series rivals the performance of advanced closed-source models at a lower cost, offering an efficient solution for industrial-grade production environment.

2024

pdf bib abs

Aspect-based sentiment analysis (ABSA) aims to predict the sentiment polarity of a specific aspect within a given sentence. Most existing methods predominantly leverage semantic or syntactic information based on attention scores, which are susceptible to interference caused by irrelevant contexts and often lack sentiment knowledge at a data-specific level. In this paper, we propose a novel Dynamic Multi-granularity Attribution Network (DMAN) from the perspective of attribution. Initially, we leverage Integrated Gradients to dynamically extract attribution scores for each token, which contain underlying reasoning knowledge for sentiment analysis. Subsequently, we aggregate attribution representations from multiple semantic granularities in natural language, enhancing a profound understanding of the semantics. Finally, we integrate attribution scores with syntactic information to capture the relationships between aspects and their relevant contexts more accurately during the sentence understanding process. Extensive experiments on five benchmark datasets demonstrate the effectiveness of our proposed method.

pdf bib abs

The emergence of personalized generation has made it possible to create texts or images that meet the unique needs of users. Recent advances mainly focus on style or scene transfer based on given keywords. However, in e-commerce and recommender systems, it is almost an untouched area to explore user historical interactions, automatically mine user interests with semantic associations, and create item representations that closely align with user individual interests.In this paper, we propose a brand new framework called **I**nterest-**A**ugmented **M**ultimodal **G**enerator (**I-AM-G**). The framework first extracts tags from the multimodal information of items that the user has interacted with, and the most frequently occurred ones are extracted to rewrite the text description of the item. Then, the framework uses a decoupled text-to-text and image-to-image retriever to search for the top-K similar item text and image embeddings from the item pool. Finally, the Attention module for user interests fuses the retrieved information in a cross-modal manner and further guides the personalized generation process in collaboration with the rewritten text.We conducted extensive and comprehensive experiments to demonstrate that our framework can effectively generate results aligned with user preferences, which potentially provides a new paradigm of **Rewrite and Retrieve** for personalized generation.

Co-authors

Zhi Li 1

Yu Su 1

Venues

EMNLP2
ACL1

Fix author