SynBullying: A Multi-LLM Synthetic Conversational Dataset for Cyberbullying Detection

Arefeh Kazemi; Hamza Qadeer; Joachim Wagner; Hossein Hosseini; Sri Balaaji Natarajan Kalaivendan; Brian Davis

SynBullying: A Multi-LLM Synthetic Conversational Dataset for Cyberbullying Detection

Arefeh Kazemi, Hamza Qadeer, Joachim Wagner, Hossein Hosseini, Sri Balaaji Natarajan Kalaivendan, Brian Davis

Abstract

We introduce SynBullying, a synthetic multi-LLM conversational dataset for studying and detecting cyberbullying (CB). SynBullying provides a scalable and ethically safe alternative to human data collection by leveraging large language models (LLMs) to simulate realistic bullying interactions. The dataset offers (i) conversational structure, capturing multi-turn exchanges rather than isolated posts; (ii) context-aware annotations, where harmfulness is assessed within the conversational flow considering context, intent, and discourse dynamics; and (iii) fine-grained labeling, covering various CB categories for detailed linguistic and behavioral analysis. We evaluate SynBullying across five dimensions, including conversational structure, lexical patterns, sentiment/toxicity, role dynamics, harm intensity, and CB-type distribution. We further examine its utility by testing its performance as standalone training data and as an augmentation source for CB classification.

Anthology ID:: 2026.lrec-main.578
Volume:: Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:: May
Year:: 2026
Address:: Palma de Mallorca, Spain
Editors:: Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:: LREC
SIG:
Publisher:: ELRA Language Resource Association
Note:
Pages:: 7292–7306
Language:
URL:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.578/
DOI:
Bibkey:
Cite (ACL):: Arefeh Kazemi, Hamza Qadeer, Joachim Wagner, Hossein Hosseini, Sri Balaaji Natarajan Kalaivendan, and Brian Davis. 2026. SynBullying: A Multi-LLM Synthetic Conversational Dataset for Cyberbullying Detection. International Conference on Language Resources and Evaluation, main:7292–7306.
Cite (Informal):: SynBullying: A Multi-LLM Synthetic Conversational Dataset for Cyberbullying Detection (Kazemi et al., LREC 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.578.pdf

PDF Cite Search Fix data