Juan Wang


2026

The critical therapist shortage demands scalable training solutions. Standardized Patients, the gold standard, are scarce and costly. Current LLM-based approaches focus on patient simulation for conversational realism but lack pedagogical rigor as Virtual Standardized Patients, lacking faithful reactions to clinical errors and explainable feedback. To bridge this gap, we propose PUPPET, the first neural-symbolic Virtual Standardized Patient governed by an OBSERVE-THINK-BEHAVE architecture. PUPPET externalizes LLM reasoning into a symbolic system where experts implant causal associations between intervention logic (propositional logic) and patient mental states (state machine). This allows PUPPET to behave coherently with controllable and explainable psychological dynamics: intervention logic (OBSERVE) → state transition (THINK) → response (BEHAVE). Our PUPPET-TRAINER further leverages this chain to educate trainees about intervention consequences, standardizing and scaling mental health training. Experiments across three clinical scenarios confirm that PUPPET outperforms baselines in clinical faithfulness and pedagogical value.

2025

The rapid growth of video platforms has transformed information dissemination and led to an explosion of multimedia content. However, this widespread reach also introduces risks, as some users exploit these platforms to spread hate speech, which is often concealed through complex rhetoric, making hateful video detection a critical challenge. Existing detection methods rely heavily on unimodal analysis or simple feature fusion, struggling to capture cross-modal interactions and reason through implicit hate in sarcasm and metaphor. To address these limitations, we propose HVGuard, the first reasoning-based hateful video detection framework with multimodal large language models (MLLMs). Our approach integrates Chain-of-Thought (CoT) reasoning to enhance multimodal interaction modeling and implicit hate interpretation. Additionally, we design a Mixture-of-Experts (MoE) network for efficient multimodal fusion and final decision-making. The framework is modular and extensible, allowing flexible integration of different MLLMs and encoders. Experimental results demonstrate that HVGuard outperforms all existing advanced detection tools, achieving an improvement of 6.88% to 13.13% in accuracy and 9.21% to 34.37% in M-F1 on two public datasets covering both English and Chinese.