Yongkang Huang


2025

pdf bib
MAGI: Multi-Agent Guided Interview for Psychiatric Assessment
Guanqun Bi | Zhuang Chen | Zhoufu Liu | Hongkai Wang | Xiyao Xiao | Yuqiang Xie | Wen Zhang | Yongkang Huang | Yuxuan Chen | Libiao Peng | Minlie Huang
Findings of the Association for Computational Linguistics: ACL 2025

Automating structured clinical interviews could revolutionize mental healthcare accessibility, yet existing large language models (LLMs) approaches fail to align with psychiatric diagnostic protocols. We present MAGI, the first framework that transforms the gold-standard Mini International Neuropsychiatric Interview (MINI) into automatic computational workflows through coordinated multi-agent collaboration. MAGI dynamically navigates clinical logic via four specialized agents: 1) an interview tree guided navigation agent adhering to the MINI’s branching structure, 2) an adaptive question agent blending diagnostic probing, explaining, and empathy, 3) a judgment agent validating whether the response from participants meet the node, and 4) a diagnosis Agent generating Psychometric Chain-of- Thought (PsyCoT) traces that explicitly map symptoms to clinical criteria. Experimental results on 1,002 real-world participants covering depression, generalized anxiety, social anxiety and suicide shows that MAGI advances LLM- assisted mental health assessment by combining clinical rigor, conversational adaptability, and explainable reasoning.

2024

pdf bib
SafetyBench: Evaluating the Safety of Large Language Models
Zhexin Zhang | Leqi Lei | Lindong Wu | Rui Sun | Yongkang Huang | Chong Long | Xiao Liu | Xuanyu Lei | Jie Tang | Minlie Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

With the rapid development of Large Language Models (LLMs), increasing attention has been paid to their safety concerns. Consequently, evaluating the safety of LLMs has become an essential task for facilitating the broad applications of LLMs. Nevertheless, the absence of comprehensive safety evaluation benchmarks poses a significant impediment to effectively assess and enhance the safety of LLMs. In this work, we present SafetyBench, a comprehensive benchmark for evaluating the safety of LLMs, which comprises 11,435 diverse multiple choice questions spanning across 7 distinct categories of safety concerns. Notably, SafetyBench also incorporates both Chinese and English data, facilitating the evaluation in both languages. Our extensive tests over 25 popular Chinese and English LLMs in both zero-shot and few-shot settings reveal a substantial performance advantage for GPT-4 over its counterparts, and there is still significant room for improving the safety of current LLMs. We also demonstrate that the measured safety understanding abilities in SafetyBench are correlated with safety generation abilities. Data and evaluation guidelines are available at https://github.com/thu-coai/SafetyBench. Submission entrance and leaderboard are available at https://llmbench.ai/safety.

pdf bib
CharacterGLM: Customizing Social Characters with Large Language Models
Jinfeng Zhou | Zhuang Chen | Dazhen Wan | Bosi Wen | Yi Song | Jifan Yu | Yongkang Huang | Pei Ke | Guanqun Bi | Libiao Peng | JiaMing Yang | Xiyao Xiao | Sahand Sabour | Xiaohan Zhang | Wenjing Hou | Yijia Zhang | Yuxiao Dong | Hongning Wang | Jie Tang | Minlie Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

Character-based dialogue (CharacterDial) has become essential in the industry (e.g., Character.AI), enabling users to freely customize social characters for social interactions. However, the generalizability and adaptability across various conversational scenarios inherent in customizing social characters still lack public industrial solutions. To address these challenges, by dissecting well-rounded social characters composed of both inherent social profiles and external social behaviors, we manually collect a large-scale Chinese corpus featuring characters with diverse categories and behaviors, and develop CharacterGLM models alongside well-designed refinement methods. Extensive experiments show that CharacterGLM outperforms most popular open- and closed-source LLMs and performs comparably to GPT-4. We will release our data and models for local development and deployment.