Rongxing Lu

2025

pdf bib abs
ITERATE: Image-Text Enhancement, Retrieval, and Alignment for Transmodal Evolution with LLMs
Chenhan Fu | Guoming Wang | Juncheng Li | Wenqiao Zhang | Rongxing Lu | Siliang Tang
Proceedings of the 31st International Conference on Computational Linguistics

Inspired by human cognitive behavior, we introduce visual modality to enhance the performance of pure text-based question-answering tasks with the development of multimodal models. However, obtaining corresponding images through manual annotation often entails high costs. Faced with this challenge, an intuitive strategy is to use search engines or use web scraping techniques to automatically obtain relevant image information. However, the images obtained by this strategy may be of low quality and may not match the context of the original task, which could fail to improve or even decrease performance on downstream tasks. In this paper, we propose a novel framework named “ITERATE”, aimed at retrieving and optimizing the quality of images to improve the alignment between text and images. Inspired by evolutionary algorithms in reinforcement learning and driven by the synergy of large language models (LLMs) and multimodal models, ITERATE employs a series of strategic actions such as filtering, optimizing, and retrieving to acquire higher quality images, and repeats this process over multiple generations to enhance the quality of the entire image cluster. Our experimental results on the ScienceQA, ARC-Easy, and OpenDataEval datasets also verify the effectiveness of our method, showing improvements of 3.5%, 5%, and 7%, respectively.

Leveraging Large Language Models (LLMs) to build domain-specific conversational agents, especially for e-commerce customer service chatbots, is a growing focus. While existing methods enhance dialogue performance by extracting core patterns from dialogue data and integrating them into models, two key challenges persist: (1) heavy reliance on human experts for dialogue strategy induction, and (2) LLM-based automatic extraction often focuses on summarizing specific behaviors, neglecting the underlying thought processes behind strategy selection. In this paper, we present ChatMap, which focuses on enhancing customer service chatbots by mining thought processes using a Multi-Agent aPproach. Specifically, the process begins by extracting customer requests and solutions from a raw dialogue dataset, followed by clustering similar requests, analyzing the thought processes behind solutions, and refining service thoughts. Through a quality inspection and reflection mechanism, the final service thought dataset is generated, helping chatbots provide more appropriate responses. Offline experimental results show that ChatMap performs comparably to manually annotated thought processes and significantly outperforms other baselines, demonstrating its ability to automate human annotation and enhance dialogue capabilities through strategic understanding. Online A/B tests on Taobao, a popular e-commerce platform in China reveal that ChatMap can better improve customer satisfaction and address customer requests from a business perspective.

Co-authors

Venues

Fix author