Jiaxin Ye

2026

Although vision-language pre-trained (VLP) models have achieved remarkable success across multimodal tasks, they remain vulnerable to adversarial perturbations.Existing universal adversarial perturbation (UAP) methods in multimodal settings—whether generator-based or optimization-based—often suffer from limited cross-model transferability, especially in black-box scenarios.We attribute this limitation to the prevalent use of symmetric or distribution-level objectives that overlook the asymmetric roles of image and text modalities and the relational nature of vision-language representations.To address this issue, we propose ARG-Attack, an optimization-based framework that learns universal perturbations under an asymmetric relational-geometry driven objective.Our method integrates three complementary components: a cosine-based loss that induces directional semantic drift in visual features, a center shift loss that geometrically regularizes adversarial embeddings toward a shared semantic center, and a relational polarity loss that explicitly disrupts image–text matching relationships.Together, these objectives enable effective cross-modal interaction without relying on model-specific training losses or probabilistic distribution matching.In addition, we adopt an adaptive gradient update strategy inspired by Adam optimization to stabilize training and accelerate convergence.Extensive experiments across multiple vision-language models and tasks demonstrate that ARG-Attack achieves competitive white-box performance and significantly outperforms state-of-the-art methods in black-box transfer settings.

2024

pdf bib abs

We propose emotion2vec, a universal speech emotion representation model. emotion2vec is pre-trained on open-source unlabeled emotion data through self-supervised online distillation, combining utterance-level loss and frame-level loss during pre-training. emotion2vec outperforms state-of-the-art pre-trained universal models and emotion specialist models by only training linear layers for the speech emotion recognition task on the mainstream IEMOCAP dataset. In addition, emotion2vec shows consistent improvements among 10 different languages of speech emotion recognition datasets. emotion2vec also shows excellent results on other emotion tasks, such as song emotion recognition, emotion prediction in conversation, and sentiment analysis. Comparison experiments, ablation experiments, and visualization comprehensively demonstrate the universal capability of the proposed emotion2vec. To the best of our knowledge, emotion2vec is the first universal representation model in various emotion-related tasks, filling a gap in the field.

Co-authors

Venues

Findings2

Fix author