Zhiqiang Wang (王智强)

This is an internal, incomplete preview of a proposed change to the ACL Anthology. For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes. Do not treat this content as an official publication.

Zhiqiang Wang

Also published as: 智强王

2025

pdf bib abs
LOG: A Local-to-Global Optimization Approach for Retrieval-based Explainable Multi-Hop Question Answering
Hao Xu | Yunxiao Zhao | Jiayang Zhang | Zhiqiang Wang | Ru Li
Proceedings of the 31st International Conference on Computational Linguistics

Multi-hop question answering (MHQA) aims to utilize multi-source intensive documents retrieved to derive the answer. However, it is very challenging to model the importance of knowledge retrieved. Previous approaches primarily emphasize single-step and multi-step iterative decomposition or retrieval, which are susceptible to failure in long-chain reasoning due to the progressive accumulation of erroneous information. To address this problem, we propose a novel Local-tO-Global optimized retrieval method (LOG) to discover more beneficial information, facilitating the MHQA. In particular, we design a pointwise conditional v-information based local information modeling to cover usable documents with reasoning knowledge. We also improve tuplet objective loss, advancing multi-examples-aware global optimization to model the relationship between scattered documents. Extensive experimental results demonstrate our proposed method outperforms prior state-of-the-art models, and it can significantly improve multi-hop reasoning, notably for long-chain reasoning.

Understanding the vulnerabilities of Large Vision Language Models (LVLMs) to jailbreak attacks is essential for their responsible real-world deployment. Most previous work requires access to model gradients, or is based on human knowledge (prompt engineering) to complete jailbreak, and they hardly consider the interaction of images and text, resulting in inability to jailbreak in black box scenarios or poor performance. To overcome these limitations, we propose a Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for toxicity maximization, referred to as PBI-Attack. Our method begins by extracting malicious features from a harmful corpus using an alternative LVLM and embedding these features into a benign image as prior information. Subsequently, we enhance these features through bidirectional cross-modal interaction optimization, which iteratively optimizes the bimodal perturbations in an alternating manner through greedy search, aiming to maximize the toxicity of the generated response. The toxicity level is quantified using a well-trained evaluation model. Experiments demonstrate that PBI-Attack outperforms previous state-of-the-art jailbreak methods, achieving an average attack success rate of 92.5% across three open-source LVLMs and around 67.3% on three closed-source LVLMs. Disclaimer: This paper contains potentially disturbing and offensive content.

pdf bib abs
Gibberish is All You Need for Membership Inference Detection in Contrastive Language-Audio Pretraining
Ruoxi Cheng | Yizhong Ding | Shuirong Cao | Zhiqiang Wang | Shitong Shao
Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)

Audio can disclose PII, particularly when combined with related text data. Therefore, it is essential to develop tools to detect privacy leakage in Contrastive Language-Audio Pretraining(CLAP). Existing MIAs need audio as input, risking exposure of voiceprint and requiring costly shadow models. We first propose PRMID, a membership inference detector based probability ranking given by CLAP, which does not require training shadow models but still requires both audio and text of the individual as input. To address these limitations, we then propose USMID, a textual unimodal speaker-level membership inference detector, querying the target model using only text data. We randomly generate textual gibberish that are clearly not in training dataset. Then we extract feature vectors from these texts using the CLAP model and train a set of anomaly detectors on them. During inference, the feature vector of each test text is input into the anomaly detector to determine if the speaker is in the training set (anomalous) or not (normal). If available, USMID can further enhance detection by integrating real audio of the tested speaker. Extensive experiments on various CLAP model architectures and datasets demonstrate that USMID outperforms baseline methods using only text data.

Understanding the vulnerabilities of Large Vision Language Models (LVLMs) to jailbreak attacks is essential for their responsible real-world deployment. Most previous work requires access to model gradients, or is based on human knowledge (prompt engineering) to complete jailbreak, and they hardly consider the interaction of images and text, resulting in inability to jailbreak in black box scenarios or poor performance. To overcome these limitations, we propose a Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for toxicity maximization, referred to as PBI-Attack. Our method begins by extracting malicious features from a harmful corpus using an alternative LVLM and embedding these features into a benign image as prior information. Subsequently, we enhance these features through bidirectional cross-modal interaction optimization, which iteratively optimizes the bimodal perturbations in an alternating manner through greedy search, aiming to maximize the toxicity of the generated response. The toxicity level is quantified using a well-trained evaluation model.Experiments demonstrate that PBI-Attack outperforms previous state-of-the-art jailbreak methods, achieving an average attack success rate of 92.5% across three open-source LVLMs and around 67.3% on three closed-source LVLMs.redDisclaimer: This paper contains potentially disturbing and offensive content.

2024

pdf bib abs
AGR: Reinforced Causal Agent-Guided Self-explaining Rationalization
Yunxiao Zhao | Zhiqiang Wang | Xiaoli Li | Jiye Liang | Ru Li
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Most existing rationalization approaches are susceptible to degeneration accumulation due to a lack of effective control over the learning direction of the model during training. To address this issue, we propose a novel approach AGR (Agent-Guided Rationalization), guiding the next action of the model based on its current training state. Specifically, we introduce causal intervention calculus to quantify the causal effects inherent during rationale training, and utilize reinforcement learning process to refine the learning bias of them. Furthermore, we pretrain an agent within this reinforced causal environment to guide the next step of the model. We theoretically demonstrate that a good model needs the desired guidance, and empirically show the effectiveness of our approach, outperforming existing state-of-the-art methods on BeerAdvocate and HotelReview datasets.

pdf bib abs
MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension
Ting Liu | Zunnan Xu | Yue Hu | Liangtao Shi | Zhiqiang Wang | Quanjun Yin
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Referring Expression Comprehension (REC), which aims to ground a local visual region via natural language, is a task that heavily relies on multimodal alignment. Most existing methods utilize powerful pre-trained models to transfer visual/linguistic knowledge by full fine-tuning. However, full fine-tuning the entire backbone not only breaks the rich prior knowledge embedded in the pre-training, but also incurs significant computational costs. Motivated by the recent emergence of Parameter-Efficient Transfer Learning (PETL) methods, we aim to solve the REC task in an effective and efficient manner. Directly applying these PETL methods to the REC task is inappropriate, as they lack the specific-domain abilities for precise local visual perception and visual-language alignment. Therefore, we propose a novel framework of Multimodal Prior-guided Parameter Efficient Tuning, namely MaPPER. Specifically, MaPPER comprises Dynamic Prior Adapters guided by a aligned prior, and Local Convolution Adapters to extract precise local semantics for better visual perception. Moreover, the Prior-Guided Text module is proposed to further utilize the prior for facilitating the cross-modal alignment. Experimental results on three widely-used benchmarks demonstrate that MaPPER achieves the best accuracy compared to the full fine-tuning and other PETL methods with only 1.41% tunable backbone parameters.

2020

pdf bib abs
基于Self-Attention的句法感知汉语框架语义角色标注(Syntax-Aware Chinese Frame Semantic Role Labeling Based on Self-Attention)
Xiaohui Wang (王晓晖) | Ru Li (李茹) | Zhiqiang Wang (王智强) | Qinghua Chai (柴清华) | Xiaoqi Han (韩孝奇)
Proceedings of the 19th Chinese National Conference on Computational Linguistics

框架语义角色标注(Frame Semantic Role Labeling, FSRL)是基于FrameNet标注体系的语义分析任务。语义角色标注通常对句法有很强的依赖性,目前的语义角色标注模型大多基于双向长短时记忆网络Bi-LSTM,虽然可以获取句子中的长距离依赖信息,但无法很好获取句子中的句法信息。因此,引入self-attention机制来捕获句子中每个词的句法信息。实验结果表明,该模型在CFN(Chinese FrameNet,汉语框架网)数据集上的F1达到83.77%,提升了近11%。