Ismael Garrido-Muñoz

Also published as: Ismael Garrido-Munoz


2026

LLMs perpetuate societal biases, such as gender stereotypes, reinforcing harmful norms and posing significant fairness risks in real-world applications. We investigate a fine-grained mitigation technique that moves beyond surface-level fixes. Our approach uses attribution graphs to identify and directly steer bias-implicated features within a Sparse Autoencoder’s (SAE) latent space. This method, known as feature steering, offers a theoretically precise, surgical intervention aimed at correcting bias at its neural source without costly retraining. We critically examine its practical reliability across various contexts. We find that steering effectiveness is highly sensitive to parameter tuning, often requiring unpredictable, context-specific adjustments. The intervention’s success exists in narrow "sweet spots," outside of which performance can degrade catastrophically. This demonstrates that while direct intervention on learned features is a powerful analytical tool, significant challenges of brittleness and instability hinder its application as a consistent, broad-scale debiasing solution, necessitating research into more robust control mechanisms.
While Large Language Models (LLMs) demonstrate remarkable text generation capabilities, they also risk inheriting and perpetuating harmful societal biases present in their vast training data. This study presents a rigorous, large-scale analysis of gender bias in a diverse set of 20 publicly available Spanish generative LLMs, ranging from 760M to 11B parameters. Our methodology utilizes a comprehensive set of specifically designed sentence templates to elicit adjectival descriptions associated with men and women in neutral contexts. We then extract and manually classify these adjectives using the Supersenses lexicosemantic framework, focusing on four key domains: BODY, BEHAVIOR, FEELING, and MIND. Our research uncovers systematic patterns consistent with pervasive cultural stereotypes, echoing findings from earlier masked language models. Women are disproportionately described by physical and emotional attributes, whereas men are more frequently associated with behavioral and cognitive traits. Finally, we investigate the relationship between model size and the intensity of these observed gender biases, offering crucial insights into how scaling affects fairness and equity in non-English models.

2025

Addressing the critical need for robust bias testing in AI systems, current methods often rely on overly simplistic or rigid persona templates, limiting the depth and realism of fairness evaluations. We introduce a novel framework and an associated tool designed to generate high-quality, diverse, and configurable personas specifically for nuanced bias assessment. Our core innovation lies in a two-stage process: first, generating structured persona tags based solely on user-defined configurations (specified manually or via an included agent tool), ensuring attribute distributions are controlled and crucially, are not skewed by an LLM’s inherent biases regarding attribute correlations during the selection phase. Second, transforming these controlled tags into various realistic outputs—including natural language descriptions, CVs, or profiles—suitable for diverse bias testing scenarios. This tag-centric approach preserves ground-truth attributes for analyzing correlations and biases within the generated population and downstream AI applications. We demonstrate the system’s efficacy by generating and validating 1,000 personas, analyzing both the adherence of natural language descriptions to the source tags and the potential biases introduced by the LLM during the transformation step. The provided dataset, including both generated personas and their source tags, enables detailed analysis. This work offers a significant step towards more reliable, controllable, and representative fairness testing in AI development.