Yunghwei Lai
2026
TheraAgent: Self-Improving Therapeutic Agent for Precise and Comprehensive Treatment Planning
Junkai Li | Yunghwei Lai | Tianyi Zhu | Zheng Long Lee | Weizhi Ma | Yang Liu
Findings of the Association for Computational Linguistics: ACL 2026
Junkai Li | Yunghwei Lai | Tianyi Zhu | Zheng Long Lee | Weizhi Ma | Yang Liu
Findings of the Association for Computational Linguistics: ACL 2026
Formulating a treatment plan is inherently a complex reasoning and refinement task rather than a simple generation problem. However, existing large language models (LLMs) mainly rely on one-shot output without explicit verification, which may result in rough, incomplete, and potentially unsafe treatment plans. To address these limitations, we propose TheraAgent, an agentic framework that replaces one-shot generation with an iterative generate-judge-refine pipeline. By mirroring the actual reasoning process of human experts who iteratively revise treatment plans, our framework progressively transforms coarse and incomplete drafts into precise, comprehensive, and safer therapeutic regimens. To facilitate the critical judge component, we introduce TheraJudge, a treatment-specific evaluation module integrated into the inference loop to enforce clinical standards. Experiments show TheraAgent achieves state-of-the-art results on HealthBench, leading in Accuracy and Completeness. In expert evaluations, it attains an 86% win rate against physicians, with superior Targeting and Harm Control. Moreover, the highly agreement between TheraJudge and HealthBench evaluations confirms the reliability of our framework.
Beyond "I Don’t Know": Evaluating LLM Self-Awareness in Discriminating Data and Model Uncertainty
Jingyi Ren | Ante Wang | Yunghwei Lai | Xiaolong Wang | Linlu Gong | Weitao Li | Weizhi Ma | Yang Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jingyi Ren | Ante Wang | Yunghwei Lai | Xiaolong Wang | Linlu Gong | Weitao Li | Weizhi Ma | Yang Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reliable Large Language Models (LLMs) should abstain when confidence is insufficient. However, prior studies often treat refusal as a generic "I don’t know”, failing to distinguish input-level ambiguity (data uncertainty) from capability limitations (model uncertainty). This lack of distinction limits downstream action decisions like requesting clarification or invoking external tools.In this work, we introduce UA-Bench, a benchmark of over 3,500 questions drawn from six datasets spanning knowledge-intensive and reasoning-intensive tasks, designed to evaluate explicit uncertainty attribution.An evaluation of 18 frontier LLMs shows that even state-of-the-art models struggle to reliably discriminate between data uncertainty and model uncertainty, and that high answer accuracy does not necessarily imply strong uncertainty attribution ability.To narrow this gap, we propose a lightweight data synthesis and reinforcement learning strategy. Experiments on both Qwen3-4B-Instruct-2507 and Qwen3-8B in thinking mode show that the proposed method improves uncertainty attribution while preserving answer accuracy.Our code and data are publicly available now.
2025
MUCAR: Benchmarking Multilingual Cross-Modal Ambiguity Resolution for Multimodal Large Language Models
Xiaolong Wang | Zhaolu Kang | Wangyuxuan Zhai | Xinyue Lou | Yunghwei Lai | Ziyue Wang | Yawen Wang | Kaiyu Huang | Yile Wang | Peng Li | Yang Liu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Xiaolong Wang | Zhaolu Kang | Wangyuxuan Zhai | Xinyue Lou | Yunghwei Lai | Ziyue Wang | Yawen Wang | Kaiyu Huang | Yile Wang | Peng Li | Yang Liu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Multimodal Large Language Models (MLLMs) have demonstrated significant advances across numerous vision-language tasks. Due to their strong performance in image-text alignment, MLLMs can effectively understand image-text pairs with clear meanings. However, effectively resolving the inherent ambiguities in natural language and visual contexts remains challenging. Existing multimodal benchmarks typically overlook linguistic and visual ambiguities, relying mainly on unimodal context for disambiguation and thus failing to exploit the mutual clarification potential between modalities. To bridge this gap, we introduce MUCAR, a novel and challenging benchmark designed explicitly for evaluating multimodal ambiguity resolution across multilingual and cross-modal scenarios. MUCAR includes: (1) a multilingual dataset where ambiguous textual expressions are uniquely resolved by corresponding visual contexts, and (2) a dual-ambiguity dataset that systematically pairs ambiguous images with ambiguous textual contexts, with each combination carefully constructed to yield a single, clear interpretation through mutual disambiguation. Extensive evaluations involving 19 state-of-the-art multimodal models—encompassing both open-source and proprietary architectures—reveal substantial gaps compared to human-level performance, highlighting the need for future research into more sophisticated cross-modal ambiguity comprehension methods, further pushing the boundaries of multimodal reasoning.
2024
ToMBench: Benchmarking Theory of Mind in Large Language Models
Zhuang Chen | Jincenzi Wu | Jinfeng Zhou | Bosi Wen | Guanqun Bi | Gongyao Jiang | Yaru Cao | Mengting Hu | Yunghwei Lai | Zexuan Xiong | Minlie Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhuang Chen | Jincenzi Wu | Jinfeng Zhou | Bosi Wen | Guanqun Bi | Gongyao Jiang | Yaru Cao | Mengting Hu | Yunghwei Lai | Zexuan Xiong | Minlie Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Theory of Mind (ToM) is the cognitive capability to perceive and ascribe mental states to oneself and others. Recent research has sparked a debate over whether large language models (LLMs) exhibit a form of ToM. However, existing ToM evaluations are hindered by challenges such as constrained scope, subjective judgment, and unintended contamination, yielding inadequate assessments. To address this gap, we introduce ToMBench with three key characteristics: a systematic evaluation framework encompassing 8 tasks and 31 abilities in social cognition, a multiple-choice question format to support automated and unbiased evaluation, and a build-from-scratch bilingual inventory to strictly avoid data leakage. Based on ToMBench, we conduct extensive experiments to evaluate the ToM performance of 10 popular LLMs across tasks and abilities. We find that even the most advanced LLMs like GPT-4 lag behind human performance by over 10% points, indicating that LLMs have not achieved a human-level theory of mind yet. Our aim with ToMBench is to enable an efficient and effective evaluation of LLMs’ ToM capabilities, thereby facilitating the development of LLMs with inherent social intelligence.
Search
Fix author
Co-authors
- Yang Liu 3
- Weizhi Ma 2
- Guanqun Bi 1
- Yaru Cao 1
- Zhuang Chen 1
- Linlu Gong 1
- Mengting Hu 1
- Minlie Huang 1
- Kaiyu Huang (黄锴宇) 1
- Gongyao Jiang 1
- Zhaolu Kang 1
- Zheng Long Lee 1
- Junkai Li 1
- Peng Li 1
- Weitao Li 1
- Xinyue Lou (娄馨月) 1
- Jingyi Ren 1
- Xiaolong Wang 1
- Ziyue Wang 1
- Yawen Wang 1
- Yile Wang 1
- Ante Wang 1
- Xiaolong Wang 1
- Bosi Wen 1
- Jincenzi Wu 1
- Zexuan Xiong 1
- Wangyuxuan Zhai 1
- Jinfeng Zhou 1
- Tianyi Zhu 1