Shu-Tao Xia

2025

Composed Image Retrieval (CIR) enables users to search for images using multimodal queries that combine text and reference images. While metric learning methods have shown promise, they rely on deterministic point embeddings that fail to capture the inherent uncertainty in the input data, in which user intentions may be imprecisely specified or open to multiple interpretations. We address this challenge by reformulating CIR through our proposed Composed Probabilistic Embedding (CoPE) framework, which represents both queries and targets as Gaussian distributions in latent space rather than fixed points. Through careful design of probabilistic distance metrics and hierarchical learning objectives, CoPE explicitly captures uncertainty at both instance and feature levels, enabling more flexible, nuanced, and robust matching that can handle polysemy and ambiguity in search intentions. Extensive experiments across multiple benchmarks demonstrate that CoPE effectively quantifies both quality and semantic uncertainties within Composed Image Retrieval, achieving state-of-the-art performance on recall rate. Code: https://github.com/tanghme0w/ACL25-CoPE.

pdf bib abs
Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models
Kuofeng Gao | Shu-Tao Xia | Ke Xu | Philip Torr | Jindong Gu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Audio-Language Models (LALMs), such as GPT-4o, have recently unlocked audio dialogue capabilities, enabling direct spoken exchanges with humans. The potential of LALMs broadens their applicability across a wide range of practical scenarios supported by audio dialogues. However, given these advancements, a comprehensive benchmark to evaluate the performance of LALMs in the open-ended audio dialogue understanding remains absent currently. To address this gap, we propose an **A**udio **D**ialogue **U**nderstanding **Bench**mark **(ADU-Bench),** which consists of 4 benchmark datasets. They assess the open-ended audio dialogue ability for LALMs in 3 general scenarios, 12 skills, 9 multilingual languages, and 4 categories of ambiguity handling. Notably, *we firstly propose the evaluation of ambiguity handling* in audio dialogues that expresses different intentions beyond the same literal meaning of sentences, *e.g.,* ‘“Really!?”‘ with different intonations. In summary, ADU-Bench includes over 20,000 open-ended audio dialogues for the assessment of LALMs. Through extensive experiments conducted on 16 LALMs, our analysis reveals that existing LALMs struggle with mathematical symbols and formulas, understanding human behavior such as roleplay, comprehending multiple languages, and handling audio dialogue ambiguities from different phonetic elements, such as intonations, pause positions, and homophones. The benchmark is available at https://adu-bench.github.io/.

2018

pdf bib abs
Exploiting Common Characters in Chinese and Japanese to Learn Cross-Lingual Word Embeddings via Matrix Factorization
Jilei Wang | Shiying Luo | Weiyan Shi | Tao Dai | Shu-Tao Xia
Proceedings of the Third Workshop on Representation Learning for NLP

Learning vector space representation of words (i.e., word embeddings) has recently attracted wide research interests, and has been extended to cross-lingual scenario. Currently most cross-lingual word embedding learning models are based on sentence alignment, which inevitably introduces much noise. In this paper, we show in Chinese and Japanese, the acquisition of semantic relation among words can benefit from the large number of common characters shared by both languages; inspired by this unique feature, we design a method named CJC targeting to generate cross-lingual context of words. We combine CJC with GloVe based on matrix factorization, and then propose an integrated model named CJ-Glo. Taking two sentence-aligned models and CJ-BOC (also exploits common characters but is based on CBOW) as baseline algorithms, we compare them with CJ-Glo on a series of NLP tasks including cross-lingual synonym, word analogy and sentence alignment. The result indicates CJ-Glo achieves the best performance among these methods, and is more stable in cross-lingual tasks; moreover, compared with CJ-BOC, CJ-Glo is less sensitive to the alteration of parameters.

Co-authors

Ke Xu 1

Venues

acl2
repl4nlp1

Fix author