Towards Safer Calls for Everyone: Designing a Benchmark Dataset for Evaluating Voice Phishing Detection Models

Joeun Kang; Gyuri Choi; Chanhyuk Yoon; Yongbin Jeong; Younggyun Hahm; Shea Husband; Hansaem Kim

Towards Safer Calls for Everyone: Designing a Benchmark Dataset for Evaluating Voice Phishing Detection Models

Joeun Kang, Gyuri Choi, Chanhyuk Yoon, Yongbin Jeong, Younggyun Hahm, Shea Husband, Hansaem Kim

Abstract

Voice phishing is an evolving form of social engineering crime and requires the continuous advancement of detection technologies. We introduce a benchmark dataset designed to evaluate the practical performance of AI-based voice phishing detection models. The dataset includes diverse voice conversation scenarios and supports four evaluation tasks to assess open-source language models. Experimental results show that while some large-scale models demonstrate stable performance across multiple tasks, accuracy remains low in topic classification and dialogue structure recognition, regardless of model size. These findings highlight the complexity of voice phishing detection, which demands contextual reasoning and dialogue structure understanding beyond simple sentence-level comprehension. The proposed benchmark dataset provides a foundation for more robust evaluation and development of AI systems capable of detecting deceptive voice interactions, contributing to safer and more trustworthy communication environments

Anthology ID:: 2026.lrec-main.585
Volume:: Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:: May
Year:: 2026
Address:: Palma de Mallorca, Spain
Editors:: Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:: LREC
SIG:
Publisher:: ELRA Language Resource Association
Note:
Pages:: 7391–7404
Language:
URL:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.585/
DOI:
Bibkey:
Cite (ACL):: Joeun Kang, Gyuri Choi, Chanhyuk Yoon, Yongbin Jeong, Younggyun Hahm, Shea Husband, and Hansaem Kim. 2026. Towards Safer Calls for Everyone: Designing a Benchmark Dataset for Evaluating Voice Phishing Detection Models. International Conference on Language Resources and Evaluation, main:7391–7404.
Cite (Informal):: Towards Safer Calls for Everyone: Designing a Benchmark Dataset for Evaluating Voice Phishing Detection Models (Kang et al., LREC 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.585.pdf

PDF Cite Search Fix data