Gail Rosen

2026

Counterfactual Auditing of Cross-Cultural Variation in LLM-Generated Medical Advice
Hyunwoo Yoo | Gail Rosen
Proceedings of the 1st Workshop on Stereotypes Across Cultures in Language Technologies (StereACuLT 2026)

Large language models (LLMs) are increasingly explored for patient-facing medical advice and symptom triage, yet their responses may shift when identical clinical evidence is paired with culturally marked patient descriptors. We present a counterfactual audit framework for evaluating cross-cultural variation in LLM-generated medical advice by isolating identity-related cues while holding clinical evidence constant.Our evaluation uses matched clinical vignettes, cross-regional and culturally marked prompt variants, repeated sampling, and structured comparison of urgency framing, safety recommendations, empathy, and escalation advice.Across multiple commercial and open-weight LLMs, we observe measurable identity-conditioned variation in both triage decisions and interactional framing. In several cases, culturally marked descriptors shift urgency assessments or escalation recommendations despite unchanged clinical evidence. While the magnitude and direction of these effects differ across models, the results suggest that LLM-generated medical advice remains sensitive to culturally linked identity cues in ways that may affect safety-critical guidance.Our results demonstrate how culturally grounded counterfactual auditing can help identify clinically unsupported variation while distinguishing potentially harmful shifts from appropriate communication adaptation in patient-facing medical advice.

pdf bib abs

Recent Large Audio-Language Models (LALMs) integrate acoustic capabilities into reasoning, yet whether they reliably ground clinical judgments in audible evidence remains unproven. We introduce CliniCAST (Clinical Controlled Acoustic Synthetic Triage), a controlled benchmark that disentangles clinically meaningful acoustic cues from lexical content and speaker demographics. CliniCAST comprises 5,856 synthetic samples across 12 disease conditions: 4,800 audio samples forming 2,400 tagged–untagged pairs for five-level emergency triage, and 1,056 audio–text inconsistent samples in which reassuring speech is paired with high-risk acoustic cues. Evaluating a diverse suite of audio-capable foundation models, we find that LALMs exhibit fragile acoustic grounding and a pronounced “text dominance” failure mode: reassuring lexical content suppresses response to audible distress signals even under safety-critical conditions. Age and gender interactions are weak across conditions, indicating that the primary failure mode is insufficient cross-modal integration rather than demographic bias. These results suggest current LALMs are not yet robust enough for high-stakes medical triage, and motivate training objectives that explicitly enforce reliance on clinically grounded audible evidence.

pdf bib abs

Visual Interference in Speech Evaluation: Cultural Asymmetry and Cross-Modal Bias in MLLMs
Kyusik Kim | Hyunwoo Yoo | Jaehoon Choi | Gail Rosen | Bongwon Suh
Findings of the Association for Computational Linguistics: ACL 2026

The transition to end-to-end Multimodal Large Language Models (MLLMs) has positioned these architectures as active social evaluators in high-stakes domains. However, it remains unclear whether these models maintain objective auditory perception or succumb to the "Hearing with Eyes" phenomenon, where visual racial cues distort linguistic proficiency evaluations. We investigate this cross-modal bias by constructing a controlled counterfactual dataset utilizing a Visual Matched-Guise Paradigm. By pairing identical native audio with diverse visual personas across English and Korean contexts, we reveal a distinct Cultural Asymmetry in model behavior. In Anglophone settings, most closed models exhibit Reverse Linguistic Stereotyping, hallucinating non-native accents for Asian speakers despite standard native audio. Conversely, in Korean settings, the same models assign baseline-relative competence premiums across all visual personas, with the largest gains for out-group (White/Black) speakers, consistent with Expectancy Violation Theory. Our findings demonstrate that MLLMs do not merely process sensory inputs but actively reproduce context-dependent sociolinguistic ideologies.

2025

pdf bib abs

Can Large Language Models Classify and Generate Antimicrobial Resistance Genes?
Hyunwoo Yoo | Haebin Shin | Gail Rosen
Proceedings of the 24th Workshop on Biomedical Language Processing

This study explores the application of generative Large Language Models (LLMs) in DNA sequence analysis, highlighting their advantages over encoder-based models like DNABERT2 and Nucleotide Transformer. While encoder models excel in classification, they struggle to integrate external textual information. In contrast, generative LLMs can incorporate domain knowledge, such as BLASTn annotations, to improve classification accuracy even without fine-tuning. We evaluate this capability on antimicrobial resistance (AMR) gene classification, comparing generative LLMs with encoder-based baselines. Results show that LLMs significantly enhance classification when supplemented with textual information. Additionally, we demonstrate their potential in DNA sequence generation, further expanding their applicability. Our findings suggest that LLMs offer a novel paradigm for integrating biological sequences with external knowledge, bridging gaps in traditional classification methods.

Co-authors

Haebin Shin 1

Venues

Fix author