Xin Liao
2026
Generative Text-to-Image Retrieval via Hierarchical Identifiers and Semantic Internalization
Jie Huang | Junjie Wang | Xin Liao | Ziyou Jiang | Wenshuo Wang | Shoubin Li | Qing Wang
Findings of the Association for Computational Linguistics: ACL 2026
Jie Huang | Junjie Wang | Xin Liao | Ziyou Jiang | Wenshuo Wang | Shoubin Li | Qing Wang
Findings of the Association for Computational Linguistics: ACL 2026
Generative Retrieval (GR) has emerged as a promising text-to-image paradigm, yet it suffers from limited semantic discriminability, alignment bias, and closed-set restrictions. To address these challenges, we propose SIGMA, a novel framework for Semantic Internalization for Generative Multimodal Alignment. SIGMA constructs multi-granularity hierarchical identifiers to ensure unique, semantically consistent image representations. We further introduce a progressive semantic internalization training strategy augmented with semantic soft labels, which captures fine-grained text-image affinities and enables inductive identifier assignment for unseen samples realizing open-set dynamic indexing capabilities. Experiments on the Flickr30K and MS-COCO datasets demonstrate that SIGMA outperforms state-of-the-art baselines, achieving average Recall@1, Recall@5, and Recall@10 improvements of 10.65%, 8.50%, and 7.00%, respectively.
SAGE: Synergistic Adaptive Gating of Experts for Hateful Video Detection
Jie Huang | Xin Liao | Junjie Wang | Mingyang Li | Wenshuo Wang | Ziyou Jiang | Shoubin Li | Qing Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jie Huang | Xin Liao | Junjie Wang | Mingyang Li | Wenshuo Wang | Ziyou Jiang | Shoubin Li | Qing Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
With the rise of short-video platforms, hate speech has evolved from static text and memes into more covert and aggressive hateful video formats, profoundly impacting social dynamics and public sentiment. Existing detection methods typically rely on multimodal feature fusion, which blurs the distinct boundaries of modality-specific information. This leads to the feature dilution problem, where dominant benign modalities often overwhelm sparse, localized hateful cues. To address this, we propose SAGE (Synergistic Adaptive Gating of Experts), a novel framework that shifts the paradigm from blind feature mixing to decision-level arbitration. Mimicking human cognitive processes, SAGE instantiates disentangled experts to rigorously preserve modality-specific semantics, facilitates global expert deliberation for context-aware refinement, and convenes an instance-level tribunal to dynamically arbitrate the final verdict based on evidentiary salience. Extensive experiments on HateMM and MultiHateClip benchmarks demonstrate that SAGE significantly outperforms state-of-the-art methods, achieving accuracy gains of 6.37% to 21.23% and macro-F1 score gains of 6.77% to 28.01%.