Maaidah Kaleem Butt


2026

In this paper we present a systematic study of social bias in small- to mid-scale Large Language Models (LLMs), focusing on gender, religion, and race. Using our SALT (Social Appropriateness in LLM Text) dataset, we explore two bias categories—Theoretical and Practical. Theoretical bias covers General Debate and Positioned Debate while practical bias includes Career Advice, Personal Advice, and Resume Generation. We quantify bias using win-rate gaps in general debate, and negative-role assignments in positioned debate. For Practical bias, we anonymize model outputs to remove explicit demographic cues and use DeepSeek-R1 as an automated evaluator, measuring outcome disparities across groups. We also examine systemic issues in LLM-based evaluation including evaluation bias, positional bias, and length bias and validate our findings through human annotation. Our results show consistent disadvantages for White, Christian, and male-associated outputs across multiple tasks. Larger models often amplify these disparities, highlighting that scale does not guarantee fairness.