2025
pdf
bib
abs
Modeling Uncertainty in Composed Image Retrieval via Probabilistic Embeddings
Haomiao Tang
|
Jinpeng Wang
|
Yuang Peng
|
GuangHao Meng
|
Ruisheng Luo
|
Bin Chen
|
Long Chen
|
Yaowei Wang
|
Shu-Tao Xia
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Composed Image Retrieval (CIR) enables users to search for images using multimodal queries that combine text and reference images. While metric learning methods have shown promise, they rely on deterministic point embeddings that fail to capture the inherent uncertainty in the input data, in which user intentions may be imprecisely specified or open to multiple interpretations. We address this challenge by reformulating CIR through our proposed Composed Probabilistic Embedding (CoPE) framework, which represents both queries and targets as Gaussian distributions in latent space rather than fixed points. Through careful design of probabilistic distance metrics and hierarchical learning objectives, CoPE explicitly captures uncertainty at both instance and feature levels, enabling more flexible, nuanced, and robust matching that can handle polysemy and ambiguity in search intentions. Extensive experiments across multiple benchmarks demonstrate that CoPE effectively quantifies both quality and semantic uncertainties within Composed Image Retrieval, achieving state-of-the-art performance on recall rate. Code: https://github.com/tanghme0w/ACL25-CoPE.
pdf
bib
abs
RED: Unleashing Token-Level Rewards from Holistic Feedback via Reward Redistribution
Jiahui Li
|
Lin Li
|
Tai-Wei Chang
|
Kun Kuang
|
Long Chen
|
Jun Zhou
|
Cheng Yang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Reinforcement learning from human feedback (RLHF) offers a promising approach to aligning large language models (LLMs) with human preferences. Typically, a reward model is trained or supplied to act as a proxy for humans in evaluating generated responses during the reinforcement training phase. However, current reward models operate as sequence-to-one models, allocating a single, sparse, and delayed reward to an entire output sequence. This approach may overlook the significant contributions of individual tokens toward the desired outcome. To this end, we propose a more fine-grained, token-level guidance approach for RL training. Specifically, we introduce RED, a novel REward reDistribition method that evaluates and assigns specific credit to each token using an off-the-shelf reward model. Utilizing these fine-grained rewards enhances the model’s understanding of language nuances, leading to more precise performance improvements. Notably, our method does not require modifying the reward model or introducing additional training steps, thereby incurring minimal computational costs. Experimental results across diverse datasets and tasks demonstrate the superiority of our approach.
2024
pdf
bib
abs
MIND: Multimodal Shopping Intention Distillation from Large Vision-language Models for E-commerce Purchase Understanding
Baixuan Xu
|
Weiqi Wang
|
Haochen Shi
|
Wenxuan Ding
|
Huihao Jing
|
Tianqing Fang
|
Jiaxin Bai
|
Xin Liu
|
Changlong Yu
|
Zheng Li
|
Chen Luo
|
Qingyu Yin
|
Bing Yin
|
Long Chen
|
Yangqiu Song
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Improving user experience and providing personalized search results in E-commerce platforms heavily rely on understanding purchase intention. However, existing methods for acquiring large-scale intentions bank on distilling large language models with human annotation for verification. Such an approach tends to generate product-centric intentions, overlook valuable visual information from product images, and incurs high costs for scalability. To address these issues, we introduce MIND, a multimodal framework that allows Large Vision-Language Models (LVLMs) to infer purchase intentions from multimodal product metadata and prioritize human-centric ones. Using Amazon Review data, we apply MIND and create a multimodal intention knowledge base, which contains 1,264,441 intentions derived from 126,142 co-buy shopping records across 107,215 products. Extensive human evaluations demonstrate the high plausibility and typicality of our obtained intentions and validate the effectiveness of our distillation framework and filtering mechanism. Further experiments reveal the positive downstream benefits that MIND brings to intention comprehension tasks and highlight the importance of multimodal generation and role-aware filtering. Additionally, MIND shows robustness to different prompts and superior generation quality compared to previous methods.
pdf
bib
abs
Optimizing Language Models with Fair and Stable Reward Composition in Reinforcement Learning
Jiahui Li
|
Hanlin Zhang
|
Fengda Zhang
|
Tai-Wei Chang
|
Kun Kuang
|
Long Chen
|
Jun Zhou
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Reinforcement learning from human feedback (RLHF) and AI-generated feedback (RLAIF) have become prominent techniques that significantly enhance the functionality of pre-trained language models (LMs). These methods harness feedback, sourced either from humans or AI, as direct rewards or to shape reward models that steer LM optimization. Nonetheless, the effective integration of rewards from diverse sources presents a significant challenge due to their disparate characteristics. To address this, recent research has developed algorithms incorporating strategies such as weighting, ranking, and constraining to handle this complexity. Despite these innovations, a bias toward disproportionately high rewards can still skew the reinforcement learning process and negatively impact LM performance. This paper explores a methodology for reward composition that enables simultaneous improvements in LMs across multiple dimensions. Inspired by fairness theory, we introduce a training algorithm that aims to reduce disparity and enhance stability among various rewards. Our method treats the aggregate reward as a dynamic weighted sum of individual rewards, with alternating updates to the weights and model parameters. For efficient and straightforward implementation, we employ an estimation technique rooted in the mirror descent method for weight updates, eliminating the need for gradient computations. The empirical results under various types of rewards across a wide range of scenarios demonstrate the effectiveness of our method.
pdf
bib
abs
RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter
Meng Cao
|
Haoran Tang
|
Jinfa Huang
|
Peng Jin
|
Can Zhang
|
Ruyang Liu
|
Long Chen
|
Xiaodan Liang
|
Li Yuan
|
Ge Li
Findings of the Association for Computational Linguistics: ACL 2024
Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries. To date, most of the state-of-the-art TVR methods learn image-to-video transfer learning based on the large-scale pre-trained vision-language models (e.g., CLIP). However, fully fine-tuning these pre-trained models for TVR incurs prohibitively expensive computation cost. To this end, we propose to conduct efficient text-video Retrieval with a salient-and-correlated AdaPter (RAP), i.e., fine-tuning the pre-trained model with a few parameterized layers. To accommodate the text-video scenario, we equip our RAP with two indispensable characteristics including temporal sparsity and correlation. Specifically, we propose a low-rank modulation module to refine the per-image features from frozen CLIP backbone, which accentuates silent frames within the video features while alleviating temporal redundancy. Besides, we introduce an asynchronous self-attention mechanism which firstly selects top responsive visual patch and augments the correlation modeling between them with learnable temporal and patch offsets. Extensive experiments on four TVR datasets demonstrate that our RAP achieves superior or comparable performance compared to the fully fine-tuned counterpart and other parameter-efficient finetuning methods.
2023
pdf
bib
abs
Enhanced Chart Understanding via Visual Language Pre-training on Plot Table Pairs
Mingyang Zhou
|
Yi Fung
|
Long Chen
|
Christopher Thomas
|
Heng Ji
|
Shih-Fu Chang
Findings of the Association for Computational Linguistics: ACL 2023
Building cross-model intelligence that can understand charts and communicate the salient information hidden behind them is an appealing challenge in the vision and language (V+L) community. The capability to uncover the underlined table data of chart figures is a critical key to automatic chart understanding. We introduce ChartT5, a V+L model that learns how to interpret table information from chart images via cross-modal pre-training on plot table pairs. Specifically, we propose two novel pre-training objectives: Masked Header Prediction (MHP) and Masked Value Prediction (MVP) to facilitate the model with different skills to interpret the table information. We have conducted extensive experiments on chart question answering and chart summarization to verify the effectiveness of the proposed pre-training strategies. In particular, on the ChartQA benchmark, our ChartT5 outperforms the state-of-the-art non-pretraining methods by over 8% performance gains.