Zohaib Khan

2026

In this paper we present a systematic study of social bias in small- to mid-scale Large Language Models (LLMs), focusing on gender, religion, and race. Using our SALT (Social Appropriateness in LLM Text) dataset, we explore two bias categories—Theoretical and Practical. Theoretical bias covers General Debate and Positioned Debate while practical bias includes Career Advice, Personal Advice, and Resume Generation. We quantify bias using win-rate gaps in general debate, and negative-role assignments in positioned debate. For Practical bias, we anonymize model outputs to remove explicit demographic cues and use DeepSeek-R1 as an automated evaluator, measuring outcome disparities across groups. We also examine systemic issues in LLM-based evaluation including evaluation bias, positional bias, and length bias and validate our findings through human annotation. Our results show consistent disadvantages for White, Christian, and male-associated outputs across multiple tasks. Larger models often amplify these disparities, highlighting that scale does not guarantee fairness.

pdf bib abs

Plasticity vs. Rigidity: The Impact of Low-Rank Adapters on Reasoning on a Micro-Budget
Zohaib Khan | Omer Tafveez | Zoha Hayat Bhatti
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Recent advances in mathematical reasoning typically rely on massive scale, yet the question remains: can strong reasoning capabilities be induced in small language models (≤1.5B) under extreme constraints? We investigate this by training models on a single A40 GPU (48GB) for under 24 hours using Reinforcement Learning with Verifiable Rewards (RLVR) and Low-Rank Adaptation (LoRA). We find that the success of this “micro-budget" regime depends critically on the interplay between adapter capacity and model initialization. While low-rank adapters (r=8) consistently fail to capture the complex optimization dynamics of reasoning, high-rank adapters (r=256) unlock significant plasticity in standard instruction-tuned models. Our best result achieved an impressive 40.0% Pass@1 on AIME 24 (an 11.1% absolute improvement over baseline) and pushed Pass@16 to 70.0%, demonstrating robust exploration capabilities. However, this plasticity is not universal: while instruction-tuned models utilized the budget to elongate their chain-of-thought and maximize reward, heavily math-aligned models suffered performance collapse, suggesting that noisy, low-budget RL updates can act as destructive interference for models already residing near a task-specific optimum.

pdf bib abs

Misinformation is on the rise, and the strong writing capabilities of LLMs lower the barrier for malicious actors to produce and disseminate false information. We study how LLMs behave when prompted to spread misinformation across languages and target countries, and introduce GlobalLies, a multilingual parallel dataset of 440 misinformation generation prompt templates and 6,867 entities, spanning 8 languages and 195 countries. Using both human annotations and large-scale LLM-as-a-judge evaluations across hundreds of thousands of generations from state-of-the-art models, we show that misinformation generation varies systematically based on the country being discussed. Propagation of lies by LLMs is substantially higher in many lower-resource languages and for countries with a lower Human Development Index (HDI). We find that existing mitigation strategies provide uneven protection: input safety classifiers exhibit cross-lingual gaps, and retrieval-augmented fact-checking remains inconsistent across regions due to unequal information availability. We release GlobalLies for research purposes, aiming to support the development of mitigation strategies to reduce the spread of global misinformation: https://github.com/zohaib-khan5040/globallies