Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet

Berk Atil; Vipul Gupta; Sarkar Snigdha Sarathi Das; Rebecca J. Passonneau

Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet

Berk Atil, Vipul Gupta, Sarkar Snigdha Sarathi Das, Rebecca Passonneau

Abstract

Large language models (LLMs) have become ubiquitous, thus it is important to understand their risks and limitations, such as their propensity to generate harmful output. This includes smaller LLMs, which are important for settings with constrained compute resources, such as edge devices. Detection of LLM harm typically requires human annotation, which is expensive to collect. This work studies two questions: How do smaller LLMs rank regarding generation of harmful content? How well can larger LLMs annotate harmfulness? We prompt three small LLMs to elicit harmful content of various types, such as discriminatory language, offensive content, privacy invasion, or negative influence, and collect human rankings of their outputs. Then, we compare harm annotation from three state-of-the-art large LLMs with each other and with humans. We find that the smaller models differ with respect to harmfulness. We also find that large LLMs show low to moderate agreement with humans.

Anthology ID:: 2025.woah-1.30
Volume:: Proceedings of the The 9th Workshop on Online Abuse and Harms (WOAH)
Month:: August
Year:: 2025
Address:: Vienna, Austria
Editors:: Agostina Calabrese, Christine de Kock, Debora Nozza, Flor Miriam Plaza-del-Arco, Zeerak Talat, Francielle Vargas
Venues:: WOAH | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 342–354
Language:
URL:: https://preview.aclanthology.org/landing_page/2025.woah-1.30/
DOI:
Bibkey:
Cite (ACL):: Berk Atil, Vipul Gupta, Sarkar Snigdha Sarathi Das, and Rebecca Passonneau. 2025. Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet. In Proceedings of the The 9th Workshop on Online Abuse and Harms (WOAH), pages 342–354, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet (Atil et al., WOAH 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2025.woah-1.30.pdf

PDF Cite Search Fix data