Mingzi Zuo

2026

The widespread dissemination of toxic content on online platforms poses a critical threat to user experience. Toxicity detection in speech receives significantly less research attention than its text counterpart. Most existing methods rely on high-resource languages and employ a cascaded pipeline combining automatic speech recognition (ASR) and text classifiers. These designs limit robustness in low-resource languages and discard important acoustic cues. To address the lack of datasets, we construct PolySpeechTox, the first toxicity-annotated speech dataset spanning 53 languages and accent varieties, with a focus on low-resource languages and multiple accents. Based on PolySpeechTox, we conduct the first systematic study of toxic speech detection under low-resource, multilingual, and multi-accent conditions. We propose SoftPrompt-TSD, a prompt-based adaptation framework that leverages a frozen audio language model to perform end-to-end toxicity detection without ASR. The decomposed soft-prompt design balances global task alignment, cross-lingual generalization, and language-specific or accent-specific calibration. On PolySpeechTox, SoftPrompt-TSD achieves a micro-averaged ROC-AUC of 98.07%, mitigating the severe failures observed in baseline methods for several languages. In three generalization experiments, SoftPrompt-TSD demonstrates superior generalization capability and maintains robust performance against distribution shifts.

Co-authors

Bo Wang 1

Lei Zhang 1

Venues

Findings1

Fix author