Hyunwoo Yoo

2026

Whose Voice, Whose Avatar? Gender Matching Bias in Multimodal AI Teammates
Kyusik Kim | Jaehoon Choi | Hyunwoo Yoo | Bongwon Suh
Findings of the Association for Computational Linguistics: ACL 2026

Multimodal Large Language Models (MLLMs) are increasingly deployed as social agents, yet their ability to integrate conflicting identity cues remains underexplored. We audit gender bias in ten recent MLLMs using a counterfactual cooperative gaming task that pairs synthetic voices with avatars of varying gender presentation and visual fidelity. Our analysis reveals distinct bias patterns that can occur independently: closed-source models (e.g., Gemini 2.5/3) exhibit a near-deterministic “voice-matching” bias that enforces binary alignment between voice and appearance, whereas open-weight models (e.g., Qwen-2.5-Omni-7B) show limited responsiveness to vocal cues and instead exhibit context-driven stereotypes, such as preferring male avatars in combat scenarios. We further find that reducing visual realism attenuates matching tendencies in some models. These findings demonstrate that multimodal fairness is not monolithic; models may appear unbiased on one dimension while enforcing strict identity congruence or role-based stereotypes on another. Code and data are available at https://github.com/halfhoon/whose-voice-whose-avatar.

pdf bib abs

Visual Interference in Speech Evaluation: Cultural Asymmetry and Cross-Modal Bias in MLLMs
Kyusik Kim | Hyunwoo Yoo | Jaehoon Choi | Gail Rosen | Bongwon Suh
Findings of the Association for Computational Linguistics: ACL 2026

The transition to end-to-end Multimodal Large Language Models (MLLMs) has positioned these architectures as active social evaluators in high-stakes domains. However, it remains unclear whether these models maintain objective auditory perception or succumb to the "Hearing with Eyes" phenomenon, where visual racial cues distort linguistic proficiency evaluations. We investigate this cross-modal bias by constructing a controlled counterfactual dataset utilizing a Visual Matched-Guise Paradigm. By pairing identical native audio with diverse visual personas across English and Korean contexts, we reveal a distinct Cultural Asymmetry in model behavior. In Anglophone settings, most closed models exhibit Reverse Linguistic Stereotyping, hallucinating non-native accents for Asian speakers despite standard native audio. Conversely, in Korean settings, the same models assign baseline-relative competence premiums across all visual personas, with the largest gains for out-group (White/Black) speakers, consistent with Expectancy Violation Theory. Our findings demonstrate that MLLMs do not merely process sensory inputs but actively reproduce context-dependent sociolinguistic ideologies.

pdf bib abs

Recent Large Audio-Language Models (LALMs) integrate acoustic capabilities into reasoning, yet whether they reliably ground clinical judgments in audible evidence remains unproven. We introduce CliniCAST (Clinical Controlled Acoustic Synthetic Triage), a controlled benchmark that disentangles clinically meaningful acoustic cues from lexical content and speaker demographics. CliniCAST comprises 5,856 synthetic samples across 12 disease conditions: 4,800 audio samples forming 2,400 tagged–untagged pairs for five-level emergency triage, and 1,056 audio–text inconsistent samples in which reassuring speech is paired with high-risk acoustic cues. Evaluating a diverse suite of audio-capable foundation models, we find that LALMs exhibit fragile acoustic grounding and a pronounced “text dominance” failure mode: reassuring lexical content suppresses response to audible distress signals even under safety-critical conditions. Age and gender interactions are weak across conditions, indicating that the primary failure mode is insufficient cross-modal integration rather than demographic bias. These results suggest current LALMs are not yet robust enough for high-stakes medical triage, and motivate training objectives that explicitly enforce reliance on clinically grounded audible evidence.

pdf bib abs

As numerous instruction-tuning datasets continue to emerge, dynamically balancing and optimizing their mixtures has become a criticalchallenge. To address this, we propose DynamixSFT, a dynamic and automated method for instruction-tuning dataset mixture optimization. We formulate the problem as a multi-armed bandit setup and introduce a Prior-scaled Boltzmann Exploration that softly anchors the updated sampling distribution to the original dataset proportions, thereby preserving the inherent diversity and coverage of the collection. Sampling probabilities are updated using a lightweight 1-Step Look-ahead Reward, reflecting how much the dataset contributes to improving the model’s performance at its current state. We demonstrate that DynamixSFT effectively optimizes the TÜLU-2-mixture andTÜLU-3-mixture collections across 10 benchmarks, while introducing minimal computational overhead over naive sampling. Furthermore, we provide a comprehensive analysis and visualizations to offer deeper insights into the adaptive dynamics of our method.

2025

pdf bib abs

Can Large Language Models Classify and Generate Antimicrobial Resistance Genes?
Hyunwoo Yoo | Haebin Shin | Gail Rosen
Proceedings of the 24th Workshop on Biomedical Language Processing

This study explores the application of generative Large Language Models (LLMs) in DNA sequence analysis, highlighting their advantages over encoder-based models like DNABERT2 and Nucleotide Transformer. While encoder models excel in classification, they struggle to integrate external textual information. In contrast, generative LLMs can incorporate domain knowledge, such as BLASTn annotations, to improve classification accuracy even without fine-tuning. We evaluate this capability on antimicrobial resistance (AMR) gene classification, comparing generative LLMs with encoder-based baselines. Results show that LLMs significantly enhance classification when supplemented with textual information. Additionally, we demonstrate their potential in DNA sequence generation, further expanding their applicability. Our findings suggest that LLMs offer a novel paradigm for integrating biological sequences with external knowledge, bridging gaps in traditional classification methods.

pdf bib abs

Enhancing Antimicrobial Drug Resistance Classification by Integrating Sequence-Based and Text-Based Representations
Hyunwoo Yoo | Bahrad Sokhansanj | James Brown
Proceedings of the 24th Workshop on Biomedical Language Processing

Antibiotic resistance identification is essential for public health, medical treatment, and drug development. Traditional sequence-based models struggle with accurate resistance prediction due to the lack of biological context. To address this, we propose an NLP-based model that integrates genetic sequences with structured textual annotations, including gene family classifications and resistance mechanisms. Our approach leverages pretrained language models for both genetic sequences and biomedical text, aligning biological metadata with sequence-based embeddings. We construct a novel dataset based on the Antibiotic Resistance Ontology (ARO), consolidating gene sequences with resistance-related textual information. Experiments show that incorporating domain knowledge significantly improves classification accuracy over sequence-only models, reducing reliance on exhaustive laboratory testing. By integrating genetic sequence processing with biomedical text understanding, our approach provides a scalable and interpretable solution for antibiotic resistance prediction.

Co-authors

Qi Chen 1

Lei Ji 1

Venues

Fix author