Mirae Kim

2026

FENCE: A Financial and Multimodal Jailbreak Detection Dataset
Mirae Kim | Seonghun Jeong | Youngjun Kwak
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Large Language Models (LLMs) and Vision Language Models (VLMs). VLMs are particularly vulnerable because they process both text and images, creating broader attack surfaces. However, available resources for jailbreak detection are scarce, particularly in finance. To address this gap, we present FENCE, a bilingual (Korean–English) multimodal dataset for training and evaluating jailbreak detectors in financial applications. FENCE comprises 10k finance-domain text–image pairs across more than 15 finance categories, constructed via a three-step pipeline: transforming real-world financial FAQs into harmful queries using GPT-4o, collecting query-relevant images via keyword-based crawling, and fusing text and images with diverse layout strategies. Labels were assigned using GPT-4o as an evaluator, with human validation confirming 95% agreement. Experiments on 15 commercial and open-source VLMs reveal consistent vulnerabilities, with GPT-4o showing measurable attack success rates and open-source models displaying greater exposure. A baseline detector trained on FENCE achieves 99% in-distribution accuracy and maintains strong performance on external benchmarks. FENCE provides a focused resource for advancing multimodal jailbreak detection in finance and supporting safer AI deployment in sensitive domains. Content Warning: This paper includes example data that may be offensive.

2024

pdf bib abs

This study introduces a Multidisciplinary chILDhood cancer survivor question-answering (MILD) bot designed to support childhood cancer survivors facing diverse challenges in their survivorship journey. In South Korea, a shortage of experts equipped to address these unique concerns comprehensively leaves survivors with limited access to reliable information. To bridge this gap, our MILD bot employs a dual-component model featuring an intent classifier and a semantic textual similarity model. The intent classifier first analyzes the user’s query to identify the underlying intent and match it with the most suitable expert who can provide advice. Then, the semantic textual similarity model identifies questions in a predefined dataset that closely align with the user’s query, ensuring the delivery of relevant responses. This proposed framework shows significant promise in offering timely, accurate, and high-quality information, effectively addressing a critical need for support among childhood cancer survivors.

Co-authors

Hayoung Oh 1

Chaerim Park 1

Yehwi Park 1

Venues

EMNLP1
LREC1

Fix author