Jason Zhang
2025
The Structural Safety Generalization Problem
Julius Broomfield
|
Tom Gibbs
|
George Ingebretsen
|
Ethan Kosak-Hine
|
Tia Nasir
|
Jason Zhang
|
Reihaneh Iranmanesh
|
Sara Pieri
|
Reihaneh Rabbany
|
Kellin Pelrine
Findings of the Association for Computational Linguistics: ACL 2025
LLM jailbreaks are a widespread safety challenge. Given this problem has not yet been tractable, we suggest targeting a key failure mechanism: the failure of safety to generalize across semantically equivalent inputs. We further focus the target by requiring desirable tractability properties of attacks to study: explainability, transferability between models, and transferability between goals. We perform red-teaming within this framework by uncovering new vulnerabilities to multi-turn, multi-image, and translation-based attacks. These attacks are semantically equivalent by our design to their single-turn, single-image, or untranslated counterparts, enabling systematic comparisons; we show that the different structures yield different safety outcomes. We then demonstrate the potential for this framework to enable new defenses by proposing a Structure Rewriting Guardrail, which converts an input to a structure more conducive to safety assessment. This guardrail significantly improves refusal of harmful inputs, without over-refusing benign ones. Thus, by framing this intermediate challenge—more tractable than universal defenses but essential for long-term safety—we highlight a critical milestone for AI safety research.
2022
Using Deep Mixture-of-Experts to Detect Word Meaning Shift for TempoWiC
Ze Chen
|
Kangxu Wang
|
Zijian Cai
|
Jiewen Zheng
|
Jiarong He
|
Max Gao
|
Jason Zhang
Proceedings of the First Workshop on Ever Evolving NLP (EvoNLP)
This paper mainly describes the dma submission to the TempoWiC task, which achieves a macro-F1 score of 77.05% and attains the first place in this task. We first explore the impact of different pre-trained language models. Then we adopt data cleaning, data augmentation, and adversarial training strategies to enhance the model generalization and robustness. For further improvement, we integrate POS information and word semantic representation using a Mixture-of-Experts (MoE) approach. The experimental results show that MoE can overcome the feature overuse issue and combine the context, POS, and word semantic features well. Additionally, we use a model ensemble method for the final prediction, which has been proven effective by many research works.
2002
Medstract: creating large-scale information servers from biomedical texts
James Pustejovsky
|
José Castaño
|
Roser Saurí
|
Jason Zhang
|
Wei Luo
Proceedings of the ACL-02 Workshop on Natural Language Processing in the Biomedical Domain
Search
Fix author
Co-authors
- Julius Broomfield 1
- Zijian Cai 1
- José Castaño 1
- Ze Chen 1
- Max Gao 1
- show all...