Munmun De Choudhury
2026
Auditing LLM Responses to Harmful Stereotypes Targeting Mental Health Groups
Arka Dutta | Rijul Magu | Sean Kim | Seohee Yoon | Munmun De Choudhury | Ashiqur R. KhudaBukhsh
Findings of the Association for Computational Linguistics: ACL 2026
Arka Dutta | Rijul Magu | Sean Kim | Seohee Yoon | Munmun De Choudhury | Ashiqur R. KhudaBukhsh
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Models (LLMs) can exhibit imbalanced biases against vulnerable groups, but how they rationalize stereotypes and rights restrictions targeting mental health entities remains underexplored. We audit a broad suite of open-weight LLMs on stereotype-justification prompts tied to mental health identities. We find that several widely used models endorse harmful stereotypes when explicitly asked to justify them, with endorsement varying across model families, versions, and mental health conditions. Finally, we show that widely used harmful-content evaluation and moderation frameworks often miss these nuanced, discriminatory responses, highlighting a gap in current AI safety evaluation for mental health groups.
What About the Scene With the Hitler Reference? HAUNT: A Framework to Probe LLMs’ Self-consistency in Closed Domains Via Adversarial Nudge
Arka Dutta | Sujan Dutta | Rijul Magu | Soumyajit Datta | Munmun De Choudhury | Ashiqur R. KhudaBukhsh
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Arka Dutta | Sujan Dutta | Rijul Magu | Soumyajit Datta | Munmun De Choudhury | Ashiqur R. KhudaBukhsh
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hallucinations pose a critical challenge to the real-world deployment of large language models (LLMs) in high-stakes domains. In this paper, we present a framework for stress testing factual fidelity in LLMs in the presence of adversarial nudge. Our framework consists of three steps. First, we instruct the LLM to produce sets of truths and lies consistent with the closed domain in question. Next, we instruct the LLM to verify the same set of assertions as truths and lies consistent with the same closed domain. Finally, we test the robustness of the LLM against the lies generated (and verified) by itself. Our extensive evaluation, conducted using five widely known proprietary and six open LLMs across two closed domains of popular movies and novels, reveals a wide range of susceptibility to adversarial nudges: even among the strongest proprietary LLMs, Claude exhibits strong resilience, GPT and Grok demonstrate moderate resilience, while Gemini and DeepSeek show weak resilience and open models fall short significantly.
Reasoning Is Not All You Need: Examining LLMs for Multi-Turn Mental Health Conversations
Mohit Chandra | Siddharth Sriraman | Harneet Singh Khanuja | Yiqiao Jin | Munmun De Choudhury
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Mohit Chandra | Siddharth Sriraman | Harneet Singh Khanuja | Yiqiao Jin | Munmun De Choudhury
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Limited access to mental healthcare, extended wait times, and increasing capabilities of Large Language Models (LLMs) has led individuals to turn to LLMs for fulfilling their mental health needs. However, examining the multi-turn mental health conversation capabilities of LLMs remains under-explored. Existing evaluation frameworks typically focus on diagnostic accuracy and win-rates and often overlook alignment with patient-specific goals, values, and personalities required for meaningful conversations. To address this, we introduce MedAgent, a novel framework for synthetically generating realistic, multi-turn mental health sensemaking conversations and use it to create the Mental Health Sensemaking Dialogue (MHSD) dataset, comprising over 2,200 patient–LLM conversations. Additionally, we present MultiSenseEval, a holistic framework to evaluate the multi-turn conversation abilities of LLMs in healthcare settings using human-centric criteria. Our findings reveal that frontier reasoning models yield below-par performance for patient-centric communication and struggle at precise ("hard") diagnostic capabilities with average accuracy of ~31%. Additionally, we observed variation in model performance based on patient’s persona and performance drop with increasing turns in the conversation. Our work provides a comprehensive synthetic data generation framework, a dataset and evaluation framework for assessing LLMs in multi-turn mental health conversations.
Responsible Evaluation of AI for Mental Health
Hiba Arnaout | Anmol Goel | H. Andrew Schwartz | Steffen T. Eberhardt | Dana Atzil-Slonim | Gavin Doherty | Brian Schwartz | Wolfgang Lutz | Tim Althoff | Munmun De Choudhury | Hamidreza Jamalabadi | Raj Sanjay Shah | Flor Miriam Plaza-del-Arco | Dirk Hovy | Maria Liakata | Iryna Gurevych
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hiba Arnaout | Anmol Goel | H. Andrew Schwartz | Steffen T. Eberhardt | Dana Atzil-Slonim | Gavin Doherty | Brian Schwartz | Wolfgang Lutz | Tim Althoff | Munmun De Choudhury | Hamidreza Jamalabadi | Raj Sanjay Shah | Flor Miriam Plaza-del-Arco | Dirk Hovy | Maria Liakata | Iryna Gurevych
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Although artificial intelligence (AI) shows growing promise for mental health care, current approaches to evaluating AI tools in this domain remain fragmented and poorly aligned with clinical practice, social context, and first-hand user experience. This paper argues for a rethinking of responsible evaluation – what is measured, by whom, and for what purpose – by introducing an interdisciplinary framework that integrates clinical soundness, social context, and equity, providing a structured basis for evaluation. Through an analysis of 135 recent *CL publications, we identify recurring limitations, including over-reliance on generic metrics that do not capture clinical validity, therapeutic appropriateness, or user experience, limited participation from mental health professionals, and insufficient attention to safety and equity. To address these gaps, we propose a taxonomy of AI mental health support types – assessment-, intervention-, and information synthesis-oriented – each with distinct risks and evaluative requirements, and illustrate its use through case studies.
Who’s Asking? Simulating Role-Based Questions for Conversational AI Evaluation
Navreet Kaur | Hoda Ayad | Hayoung Jung | Shravika Mittal | Munmun De Choudhury | Tanu Mitra
Findings of the Association for Computational Linguistics: ACL 2026
Navreet Kaur | Hoda Ayad | Hayoung Jung | Shravika Mittal | Munmun De Choudhury | Tanu Mitra
Findings of the Association for Computational Linguistics: ACL 2026
Language model users often embed personal and social context in their questions. Theasker’s role—implicit in how the question is framed—creates specific needs for an appropriate response. However, most evaluations, while capturing the model’s capability to respond, often ignore who is asking. This gap is especially critical in stigmatized domains such as opioid use disorder (OUD), where accounting for users’ contexts is essential to provide accessible, stigma-free responses. We propose CORUS (COmmunity-driven Roles for User-centric Question Simulation), a framework for simulating role-based questions. Drawing on role theory and posts from an online OUD recovery community (r/OpiatesRecovery), we first build a taxonomy of asker roles—patients, caregivers, practitioners. Next, we use it to simulate 15,321 questions that embed each role’s goals, behaviors, and experiences. Our evaluations show that these questions are both highly believable and comparable to real-world data. When used to evaluate five LLMs, for the same question but differing roles, we find systematic differences: vulnerable roles, such as patients and caregivers, elicit more supportive responses (+17%) and reduced knowledge content (−19%) in comparison to practitioners. Our work demonstrates how implicitly signaling a user’s role shapes model responses, and provides a methodology for role-informed evaluation of conversational AI.
2025
Do Large Language Models Align with Core Mental Health Counseling Competencies?
Viet Cuong Nguyen | Mohammad Taher | Dongwan Hong | Vinicius Konkolics Possobom | Vibha Thirunellayi Gopalakrishnan | Ekta Raj | Zihang Li | Heather J. Soled | Michael L. Birnbaum | Srijan Kumar | Munmun De Choudhury
Findings of the Association for Computational Linguistics: NAACL 2025
Viet Cuong Nguyen | Mohammad Taher | Dongwan Hong | Vinicius Konkolics Possobom | Vibha Thirunellayi Gopalakrishnan | Ekta Raj | Zihang Li | Heather J. Soled | Michael L. Birnbaum | Srijan Kumar | Munmun De Choudhury
Findings of the Association for Computational Linguistics: NAACL 2025
The rapid evolution of Large Language Models (LLMs) presents a promising solution to the global shortage of mental health professionals. However, their alignment with essential counseling competencies remains underexplored. We introduce CounselingBench, a novel NCMHCE-based benchmark evaluating 22 general-purpose and medical-finetuned LLMs across five key competencies. While frontier models surpass minimum aptitude thresholds, they fall short of expert-level performance, excelling in Intake, Assessment & Diagnosis but struggling with Core Counseling Attributes and Professional Practice & Ethics. Surprisingly, medical LLMs do not outperform generalist models in accuracy, though they provide slightly better justifications while making more context-related errors. These findings highlight the challenges of developing AI for mental health counseling, particularly in competencies requiring empathy and nuanced reasoning. Our results underscore the need for specialized, fine-tuned models aligned with core mental health counseling competencies and supported by human oversight before real-world deployment. Code and data associated with this manuscript can be found at: https://github.com/cuongnguyenx/CounselingBench
Lived Experience Not Found: LLMs Struggle to Align with Experts on Addressing Adverse Drug Reactions from Psychiatric Medication Use
Mohit Chandra | Siddharth Sriraman | Gaurav Verma | Harneet Singh Khanuja | Jose Suarez Campayo | Zihang Li | Michael L. Birnbaum | Munmun De Choudhury
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Mohit Chandra | Siddharth Sriraman | Gaurav Verma | Harneet Singh Khanuja | Jose Suarez Campayo | Zihang Li | Michael L. Birnbaum | Munmun De Choudhury
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Adverse Drug Reactions (ADRs) from psychiatric medications are the leading cause of hospitalizations among mental health patients. With healthcare systems and online communities facing limitations in resolving ADR-related issues, Large Language Models (LLMs) have the potential to fill this gap. Despite the increasing capabilities of LLMs, past research has not explored their capabilities in detecting ADRs related to psychiatric medications or in providing effective harm reduction strategies. To address this, we introduce the **Psych-ADR** benchmark and the **A**dverse **D**rug Reaction **R**esponse **A**ssessment (**ADRA**) framework to systematically evaluate LLM performance in detecting ADR expressions and delivering expert-aligned mitigation strategies. Our analyses show that LLMs struggle with understanding the nuances of ADRs and differentiating between types of ADRs. While LLMs align with experts in terms of expressed emotions and tone of the text, their responses are more complex, harder to read, and only 70.86% aligned with expert strategies. Furthermore, they provide less actionable advice by a margin of 12.32% on average. Our work provides a comprehensive benchmark and evaluation framework for assessing LLMs in strategy-driven tasks within high-risk domains.
MythTriage: Scalable Detection of Opioid Use Disorder Myths on a Video-Sharing Platform
Hayoung Jung | Shravika Mittal | Ananya Aatreya | Navreet Kaur | Munmun De Choudhury | Tanu Mitra
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Hayoung Jung | Shravika Mittal | Ananya Aatreya | Navreet Kaur | Munmun De Choudhury | Tanu Mitra
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Understanding the prevalence of misinformation in health topics online can inform public health policies and interventions. However, measuring such misinformation at scale remains a challenge, particularly for high-stakes but understudied topics like opioid-use disorder (OUD)—a leading cause of death in the U.S. We present the first large-scale study of OUD-related myths on YouTube, a widely-used platform for health information. With clinical experts, we validate 8 pervasive myths and release an expert-labeled video dataset. To scale labeling, we introduce MythTriage, an efficient triage pipeline that uses a lightweight model for routine cases and defers harder ones to a high-performing, but costlier, large language model (LLM). MythTriage achieves up to 0.86 macro F1-score while estimated to reduce annotation time and financial cost by over 76% compared to experts and full LLM labeling. We analyze 2.9K search results and 343K recommendations, uncovering how myths persist on YouTube and offering actionable insights for public health and platform moderation.
2021
Latent Hatred: A Benchmark for Understanding Implicit Hate Speech
Mai ElSherief | Caleb Ziems | David Muchlinski | Vaishnavi Anupindi | Jordyn Seybolt | Munmun De Choudhury | Diyi Yang
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Mai ElSherief | Caleb Ziems | David Muchlinski | Vaishnavi Anupindi | Jordyn Seybolt | Munmun De Choudhury | Diyi Yang
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Hate speech has grown significantly on social media, causing serious consequences for victims of all demographics. Despite much attention being paid to characterize and detect discriminatory speech, most work has focused on explicit or overt hate speech, failing to address a more pervasive form based on coded or indirect language. To fill this gap, this work introduces a theoretically-justified taxonomy of implicit hate speech and a benchmark corpus with fine-grained labels for each message and its implication. We present systematic analyses of our dataset using contemporary baselines to detect and explain implicit hate speech, and we discuss key features that challenge existing models. This dataset will continue to serve as a useful benchmark for understanding this multifaceted issue.
Search
Fix author
Co-authors
- Michael L. Birnbaum 2
- Mohit Chandra 2
- Arka Dutta 2
- Hayoung Jung 2
- Navreet Kaur 2
- Harneet Singh Khanuja 2
- Ashiqur R. KhudaBukhsh 2
- Zihang Li 2
- Rijul Magu 2
- Tanu Mitra 2
- Shravika Mittal 2
- Siddharth Sriraman 2
- Ananya Aatreya 1
- Tim Althoff 1
- Vaishnavi Anupindi 1
- Hiba Arnaout 1
- Dana Atzil-Slonim 1
- Hoda Ayad 1
- Jose Suarez Campayo 1
- Soumyajit Datta 1
- Gavin Doherty 1
- Sujan Dutta 1
- Steffen T. Eberhardt 1
- Mai ElSherief 1
- Anmol Goel 1
- Vibha Thirunellayi Gopalakrishnan 1
- Iryna Gurevych 1
- Dongwan Hong 1
- Dirk Hovy 1
- Hamidreza Jamalabadi 1
- Yiqiao Jin 1
- Sean Kim 1
- Srijan Kumar 1
- Maria Liakata 1
- Wolfgang Lutz 1
- David Muchlinski 1
- Viet Cuong Nguyen 1
- Flor Miriam Plaza-del-Arco 1
- Vinicius Konkolics Possobom 1
- Ekta Raj 1
- H. Andrew Schwartz 1
- Brian Schwartz 1
- Jordyn Seybolt 1
- Raj Sanjay Shah 1
- Heather J. Soled 1
- Mohammad Taher 1
- Gaurav Verma 1
- Diyi Yang 1
- Seohee Yoon 1
- Caleb Ziems 1