Laleh Seyyed-Kalantari


2026

While zero-shot instructional prompts like "Let’s think step-by-step” have revolutionized Large Language Model performance, we lack systematic understanding of why: which specific words drive their effectiveness, and how do these patterns vary across tasks and models? We introduce the ZIP score (Zero-shot Importance of Perturbation), a metric that quantifies individual word importance through controlled, semantically meaningful perturbations. To enable rigorous evaluation, we also introduce the first ground-truth benchmark for prompt interpretability, a set of validation prompts with predetermined keywords where ZIP achieves 95.8% accuracy compared to 65.8% for LIME. Analyzing six flagship models across seven prompts and multiple task domains, we find that word importance is task-dependent ("step-by-step” dominates mathematical reasoning; "think” matters more for common-sense tasks), varies systematically across model families, and correlates inversely with model performance, suggesting prompts have greatest impact on tasks where models struggle. Our findings advance prompt science, providing both practical guidance for prompt engineering and theoretical understanding of how instructional language shapes model behavior.
Bias evaluation in large language models (LLMs) uses many metrics and benchmarks, but lacks a systematic way to measure agreement across bias metrics and models. As a result, improvements observed under one metric may contradict another, and model rankings may reflect benchmark-specific artifacts rather than stable bias profiles. In this work, we introduce Metric Agreement Score (MeAS) and Model Agreement Score (MoAS), which quantify cross-metric and cross-model agreement in bias rankings, respectively. We apply these measures to eight LLMs, seven bias metrics, and nine corpora. Our results reveal disagreement among both metrics and models: Contrary to expectations, we find that metrics within the same category (generation-based and probabilistic) often behave independently of each other. For instance, HONEST shows independence with toxicity metrics, and the Context Association Test shows no correlation with Language Modeling Bias metric. At the model level, DeepSeek-family models invert bias rankings relative to most others, indicating that the model family strongly shapes specific bias profiles. These findings challenge the assumption that bias mitigation is universally transferable and highlight the need for agreement-aware evaluation.

2025

Large language models (LLMs) struggle to navigate culturally specific communication norms, limiting their effectiveness in global contexts. We focus on Persian *taarof*, a social norm in Iranian interactions, which is a sophisticated system of ritual politeness that emphasizes deference, modesty, and indirectness, yet remains absent from existing cultural benchmarks. We introduce **TaarofBench**, the first benchmark for evaluating LLM understanding of taarof, comprising 450 role-play scenarios covering 12 common social interaction topics, validated by native speakers. Our evaluation of five frontier LLMs reveals substantial gaps in cultural competence, with accuracy rates 40-48% below native speakers when taarof is culturally appropriate. Performance varies between interaction topics, improves with Persian-language prompts, and exhibits gender-based asymmetries. We also show that responses rated “polite” by standard metrics often violate taarof norms, indicating the limitations of Western politeness frameworks. Through supervised fine-tuning and Direct Preference Optimization, we achieve 21.8% and 42.3% improvement in model alignment with cultural expectations. Our human study with 33 participants (11 native Persian, 11 heritage, and 11 non-Iranian speakers) forms baselines in varying degrees of familiarity with Persian norms. This work lays the foundation for developing diverse and culturally aware LLMs, enabling applications that better navigate complex social interactions.