Kai Hu
2026
Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models
Kai Hu | Abhinav Aggarwal | Mehran Khodabandeh | David Zhang | Eric Hsin | Li Chen | Ankit Jain | Matt Fredrikson | Akash Bharadwaj
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Kai Hu | Abhinav Aggarwal | Mehran Khodabandeh | David Zhang | Eric Hsin | Li Chen | Ankit Jain | Matt Fredrikson | Akash Bharadwaj
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This paper presents a novel Automated Red Teaming (ART) framework that shifts from example-based to policy-based evaluation, addressing critical limitations in scalability and validity. We define harmful content through abstract safety policies rather than specific static examples. We also introduce multiple evaluation objectives: risk coverage, semantic diversity, and fidelity, and discover Pareto trade-offs between them. We propose Jailbreak-Zero, a black-box method capable of both zero-shot generation and fine-tuned exploitation of a victim’s vulnerabilities to achieve Pareto optimality. Unlike prior approaches, it does not require expert-designed strategies/prompts, but still achieves superior, human-readable attacks against open-source and proprietary models (attack success rates of 99.5% against GPT-4o and 96.0% against Claude 3.5), even for unseen safety policies. It retains efficacy even after victim models undergo safety alignment, and exposes controls to navigate Pareto trade-offs without retraining. Lastly, we show that Jailbreak-Zero is the best-performing ART method at a given compute budget. Code is available at: https://github.com/hukkai/jailbreak-zero/ .
2025
MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space
Yicheng Chen | Yining Li | Kai Hu | Ma Zerun | HaochenYe HaochenYe | Kai Chen
Findings of the Association for Computational Linguistics: ACL 2025
Yicheng Chen | Yining Li | Kai Hu | Ma Zerun | HaochenYe HaochenYe | Kai Chen
Findings of the Association for Computational Linguistics: ACL 2025
Data quality and diversity are key to the construction of effective instruction-tuning datasets. With the increasing availability of open-source instruction-tuning datasets, it is advantageous to automatically select high-quality and diverse subsets from a vast amount of data. Existing methods typically prioritize instance quality and use heuristic rules to maintain diversity. However, this absence of a comprehensive view of the entire collection often leads to suboptimal results. Moreover, heuristic rules generally focus on distance or clustering within the embedding space, which fails to accurately capture the intent of complex instructions in the semantic space. To bridge this gap, we propose a unified method for quantifying the information content of datasets. This method models the semantic space by constructing a label graph and quantifies diversity based on the distribution of information within the graph. Based on such a measurement, we further introduce an efficient sampling method that selects data samples iteratively to Maximize the Information Gain (MIG) in semantic space. Experiments on various datasets and base models demonstrate that MIG consistently outperforms state-of-the-art methods. Notably, the model fine-tuned with 5% Tulu3 data sampled by MIG achieves comparable performance to the official SFT model trained on the full dataset, with improvements of +5.73% on AlpacaEval and +6.89% on Wildbench.
2022
The VolcTrans System for WMT22 Multilingual Machine Translation Task
Xian Qian | Kai Hu | Jiaqiang Wang | Yifeng Liu | Xingyuan Pan | Jun Cao | Mingxuan Wang
Proceedings of the Seventh Conference on Machine Translation (WMT)
Xian Qian | Kai Hu | Jiaqiang Wang | Yifeng Liu | Xingyuan Pan | Jun Cao | Mingxuan Wang
Proceedings of the Seventh Conference on Machine Translation (WMT)
This report describes our VolcTrans system for the WMT22 shared task on large-scale multilingual machine translation. We participated in the unconstrained track which allows the use of external resources. Our system is a transformer-based multilingual model trained on data from multiple sources including the public training set from the data track, NLLB data provided by Meta AI, self-collected parallel corpora, and pseudo bitext from back-translation. Both bilingual and monolingual texts are cleaned by a series of heuristic rules. On the official test set, our system achieves 17.3 BLEU, 21.9 spBLEU, and 41.9 chrF2++ on average over all language pairs. Averaged inference speed is 11.5 sentences per second using a single Nvidia Tesla V100 GPU.