FinHarmBench: Financial Jailbreak Benchmark and Unsupervised Safety Fine-Tuning via Refusal Steering Distillation

Yubin Choi, Yujin Yang, Subin Kim, Seokil Ham, Seungju Cho, Jungmin Son, Youngjun Kwak, Changick Kim


Abstract
Financial Large Language Models (LLMs) exhibit strong domain expertise but remain vulnerable to financially harmful prompts. To systematically assess this vulnerability, we introduce FinHarmBench, a benchmark designed to evaluate financially harmful and confusable benign prompts. Our analysis reveals a concerning result that financial LLMs can be less robust than general-purpose models, suggesting that domain adaptation alone does not guarantee financial safety alignment. To address this issue, we propose Financial Refusal Steering Distillation (FiRSD), an unsupervised training framework that strengthens financial-domain safety by learning and distilling a financial refusal direction at the representation level. FiRSD enhances refusal behavior without requiring annotated refusal responses. Experiments show that FiRSD substantially improves safety while largely preserving task capability. These results highlight the importance of domain-aware safety alignment for high-stakes financial applications.
Anthology ID:
2026.acl-industry.117
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Yunyao Li, Georg Rehm, Mei Tu
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1714–1726
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-industry.117/
DOI:
Bibkey:
Cite (ACL):
Yubin Choi, Yujin Yang, Subin Kim, Seokil Ham, Seungju Cho, Jungmin Son, Youngjun Kwak, and Changick Kim. 2026. FinHarmBench: Financial Jailbreak Benchmark and Unsupervised Safety Fine-Tuning via Refusal Steering Distillation. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), pages 1714–1726, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
FinHarmBench: Financial Jailbreak Benchmark and Unsupervised Safety Fine-Tuning via Refusal Steering Distillation (Choi et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-industry.117.pdf