Interpretability of LLM Classifiers via the Rational Inattention Theory with Application to Hate Speech Detection

Yuan Zhao, Ali Abdi


Abstract
Hate speech detection is essential for maintaining healthy online communities. Large language models (LLMs) perform well on text classification, yet their decision strategies need to be better understood. While post-hoc rationales can justify individual decisions, they substantially increase inference cost and limit scalability in high-throughput settings. As another approach, we propose an extended rational inattention model that parameterizes linguistic noise and information processing cost, providing an interpretable behavioral framework for black-box LLM classifiers. Treating LLMs as rational decision-makers under information constraints allows us to estimate - from the observed classification behavior - the parameters that represent information processing cost and noise sensitivity. As a case study and using a hate-speech dataset spanning multiple noise environments, we evaluate four commercial LLMs and show that the introduced extended rational inattention model predictions closely match the observed performance across different noise levels. We further test the performance under various noise mechanisms and find that the inferred information cost parameters remain consistent while the noise parameters vary with the distortion mechanism. Overall, our introduced framework offers a cost-efficient and quantitative approach to derive interpretable indices of LLM moderation behavior and decisions, without additional rationale generation.
Anthology ID:
2026.acl-srw.23
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Santosh T.Y.S.S., Juan Diego Rodriguez, Ona de Gibert
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
281–289
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-srw.23/
DOI:
Bibkey:
Cite (ACL):
Yuan Zhao and Ali Abdi. 2026. Interpretability of LLM Classifiers via the Rational Inattention Theory with Application to Hate Speech Detection. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), pages 281–289, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Interpretability of LLM Classifiers via the Rational Inattention Theory with Application to Hate Speech Detection (Zhao & Abdi, ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-srw.23.pdf