Yanbo Wang
Papers on this page may belong to the following people: Yanbo Wang, Yanbo Wang
2026
Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models
Zirui Song | Qian Jiang | Mingxuan Cui | Mingzhe Li | Lang Gao | Zeyu Zhang | Zixiang Xu | Yanbo Wang | Guangxian Ouyang | Zhenhao Chen | Xiuying Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zirui Song | Qian Jiang | Mingxuan Cui | Mingzhe Li | Lang Gao | Zeyu Zhang | Zixiang Xu | Yanbo Wang | Guangxian Ouyang | Zhenhao Chen | Xiuying Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The rise of Large Audio-Language Models (LAMs) brings both potential and risks, as their audio outputs may contain harmful or unethical content. However, current research lacks a systematic, quantitative evaluation of LAM safety, especially against jailbreak attacks, which are challenging due to the temporal and semantic nature of speech. To bridge this gap, we introduce AJailBench, the first benchmark specifically designed to evaluate jailbreak vulnerabilities in LAMs. We begin by constructing -Base, a dataset of 1,495 adversarial audio prompts spanning 10 policy-violating categories. Using this dataset, we evaluate several state-of-the-art LAMs and reveal that none exhibit consistent robustness across attacks. To further strengthen jailbreak testing and simulate more realistic attack conditions, we propose a method to generate dynamic adversarial variants. Our Audio Perturbation Toolkit (APT) applies targeted distortions across time, frequency, and amplitude domains. To preserve the original jailbreak intent, we enforce a semantic consistency constraint and employ Bayesian optimization to efficiently search for perturbations that are both subtle and highly effective. This results in AJailBench-APT+, an extended dataset of optimized adversarial audio samples. Our findings demonstrate that even small, semantically preserved perturbations can significantly reduce the safety performance of leading LAMs, underscoring the need for more robust and semantically aware defense mechanisms. We release AJailBench to facilitate future research: https://anonymous.4open.science/r/AudioJailbreak-4262/
RiskLab: A Controlled Toolkit for Probing Emergent Risks in LLM-Based Multi-Agent Systems
Yu Jiang | Wenjie Wang | Yue Huang | Yanbo Wang | Zhenhong Zhou | Xiuying Chen | Yang Liu | Pin-Yu Chen | Wei Wang | Xiangliang Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Yu Jiang | Wenjie Wang | Yue Huang | Yanbo Wang | Zhenhong Zhou | Xiuying Chen | Yang Liu | Pin-Yu Chen | Wei Wang | Xiangliang Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Large language model (LLM) agents increasingly operate in multi-agent settings where failures emerge from interaction dynamics rather than isolated model errors. We introduce RiskLab, an open-source toolkit for instantiating, probing, and measuring emergent risks in LLM-based multi-agent systems under controlled conditions. Each experiment is defined as a structured topology–environment–protocol–agent–task quintuple, enabling reproducible studies of how communication structure, coordination mechanisms, and incentives shape system-level risks. RiskLab provides flexible communication topologies, swappable interaction protocols, trajectory-grounded evaluation, and extensible registries for risk detectors and agent backends. We demonstrate the toolkit across representative risks, including collusion, resource overreach, semantic drift, and strategic misreporting, and support one-file reproducibility via configuration.
2025
TRUSTEVAL: A Dynamic Evaluation Toolkit on Trustworthiness of Generative Foundation Models
Yanbo Wang | Jiayi Ye | Siyuan Wu | Chujie Gao | Yue Huang | Xiuying Chen | Yue Zhao | Xiangliang Zhang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)
Yanbo Wang | Jiayi Ye | Siyuan Wu | Chujie Gao | Yue Huang | Xiuying Chen | Yue Zhao | Xiangliang Zhang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)
Ensuring the trustworthiness of Generative Foundation Models (GenFMs) is a pressing challenge as they gain widespread use. Existing evaluation toolkits are often limited in scope, dynamism, and flexibility. This paper introduces TRUSTEVAL, a dynamic and comprehensive toolkit designed for evaluating GenFMs across various dimensions. TRUSTEVAL supports both dynamic dataset generation and evaluation, offering advanced features including comprehensiveness, usability, and flexibility. TRUSTEVAL integrates diverse generative models, datasets, evaluation methods, metrics, inference efficiency enhancement, and evaluation report generation. Through case studies, we demonstrate TRUSTEVAL’s potential to advance the trustworthiness evaluation of GenFMs.
Under the Shadow of Babel: How Language Shapes Reasoning in LLMs
Chenxi Wang | Yixuan Zhang | Lang Gao | Zixiang Xu | Zirui Song | Yanbo Wang | Xiuying Chen
Findings of the Association for Computational Linguistics: EMNLP 2025
Chenxi Wang | Yixuan Zhang | Lang Gao | Zixiang Xu | Zirui Song | Yanbo Wang | Xiuying Chen
Findings of the Association for Computational Linguistics: EMNLP 2025
Language is not only a tool for communication but also a medium for human cognition and reasoning. If, as linguistic relativity suggests, the structure of language shapes cognitive patterns, then large language models (LLMs) trained on human language may also internalize the habitual logical structures embedded in different languages. To examine this hypothesis, we introduce BICAUSE, a structured bilingual dataset for causal reasoning, which includes semantically aligned Chinese and English samples in both forward and reversed causal forms. Our study reveals three key findings: (1) LLMs exhibit typologically aligned attention patterns, focusing more on causes and sentence-initial connectives in Chinese, while showing a more balanced distribution in English. (2) Models internalize language-specific preferences for causal components order and often rigidly apply them to atypical inputs, leading to degraded performance, especially in Chinese. (3) When causal reasoning succeeds, model representations converge toward semantically aligned abstractions across languages, indicating a shared understanding beyond surface form. Overall, these results suggest that LLMs not only mimic surface linguistic forms but also internalize the reasoning biases shaped by language. Rooted in cognitive linguistic theory, this phenomenon is for the first time empirically verified through structural analysis of model internals.
Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models
Zixiang Xu | Yanbo Wang | Yue Huang | Xiuying Chen | Jieyu Zhao | Meng Jiang | Xiangliang Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zixiang Xu | Yanbo Wang | Yue Huang | Xiuying Chen | Jieyu Zhao | Meng Jiang | Xiangliang Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) have achieved remarkable success in Natural Language Processing (NLP), yet their cross-lingual consistency remains a significant challenge. This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in LLMs. Our approach leverages beam search and LLM-based simulation to generate bilingual question pairs that expose performance discrepancies between English and target languages. We construct a new dataset of over 6,000 bilingual pairs across 16 languages using this methodology, demonstrating its effectiveness in revealing weaknesses even in state-of-the-art models. The extensive experiments demonstrate that our method precisely and cost-effectively pinpoints cross-lingual weaknesses, consistently revealing over 50% accuracy drops in target languages across a wide range of models. Moreover, further experiments investigate the relationship between linguistic similarity and cross-lingual weaknesses, revealing that linguistically related languages share similar performance patterns and benefit from targeted post-training. Code is available at https://github.com/xzx34/Cross-Lingual-Pitfalls.