Koustuv Saha
2026
RubRIX: Rubric-Driven Risk Mitigation in Caregiver-AI Interactions
Drishti Goel | Jeongah Lee | Qiuyue Zhong | Violeta J. Rodriguez | Daniel S. Brown | Ravi Karkar | Dong Whi Yoo | Koustuv Saha
Findings of the Association for Computational Linguistics: ACL 2026
Drishti Goel | Jeongah Lee | Qiuyue Zhong | Violeta J. Rodriguez | Daniel S. Brown | Ravi Karkar | Dong Whi Yoo | Koustuv Saha
Findings of the Association for Computational Linguistics: ACL 2026
Caregivers seeking AI-mediated support express complex needs—information-seeking, emotional validation, and distress cues—that warrant careful evaluation of response safety and appropriateness. Existing AI evaluation frameworks, primarily focused on general risks (toxicity, hallucinations, policy violations, etc) may not adequately capture the nuanced risks of LLM-responses in caregiving-contexts. We introduce RubRIX (Rubric-based Risk Index), a theory-driven, clinician-validated framework for evaluating risks in LLM caregiving responses. Grounded in the Elements of an Ethic of Care, RubRIX operationalizes five empirically-derived risk dimensions: Inattention, Bias Stigma, Information Inaccuracy, Uncritical Affirmation, and Epistemic Arrogance. We evaluate six state-of-the-art LLMs on over 20,000 caregiver queries from Reddit and ALZConnected. Rubric-guided refinement consistently reduced risk-components by 45-98% after one iteration across models. This work contributes a methodological approach for developing domain-sensitive, user-centered evaluation frameworks for high-burden contexts. Our findings highlight the importance of domain-sensitive, interactional risk evaluation for the responsible deployment of LLMs in caregiving support contexts. We release benchmark datasets to enable future research on contextual risk evaluation in AI-mediated support.
2025
MoMoE: Mixture of Moderation Experts Framework for AI-Assisted Online Governance
Agam Goyal | Xianyang Zhan | Yilun Chen | Koustuv Saha | Eshwar Chandrasekharan
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Agam Goyal | Xianyang Zhan | Yilun Chen | Koustuv Saha | Eshwar Chandrasekharan
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) have shown great potential in flagging harmful content in online communities. Yet, existing approaches for moderation require a separate model for every community and are opaque in their decision-making, limiting real-world adoption. We introduce Mixture of Moderation Experts (MoMoE), a modular, cross-community framework that adds post-hoc explanations to enable scalable content moderation. MoMoE orchestrates four operators—Allocate, Predict, Aggregate, Explain—and is instantiated as seven community-specialized experts (MoMoE-Community) and five norm-violation experts (MoMoE-NormVio). On 30 unseen subreddits, the best variants obtain Micro-F1 scores of 0.72 and 0.67, respectively, matching or surpassing strong fine-tuned baselines while consistently producing concise and reliable explanations. Although community-specialized experts deliver the highest peak accuracy, norm-violation experts provide steadier performance across domains. These findings show that MoMoE yields scalable, transparent moderation without needing per-community fine-tuning. More broadly, they suggest that lightweight, explainable expert ensembles can guide future NLP and HCI research on trustworthy human-AI governance of online communities.
SLM-Mod: Small Language Models Surpass LLMs at Content Moderation
Xianyang Zhan | Agam Goyal | Yilun Chen | Eshwar Chandrasekharan | Koustuv Saha
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Xianyang Zhan | Agam Goyal | Yilun Chen | Eshwar Chandrasekharan | Koustuv Saha
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Large language models (LLMs) have shown promise in many natural language understanding tasks, including content moderation. However, these models can be expensive to query in real-time and do not allow for a community-specific approach to content moderation. To address these challenges, we explore the use of open-source small language models (SLMs) for community-specific content moderation tasks. We fine-tune and evaluate SLMs (less than 15B parameters) by comparing their performance against much larger open- and closed-sourced models in both a zero-shot and few-shot setting. Using 150K comments from 15 popular Reddit communities, we find that SLMs outperform zero-shot LLMs at content moderation-11.5% higher accuracy and 25.7% higher recall on average across all communities. Moreover, few-shot in-context learning leads to only a marginal increase in the performance of LLMs, still lacking compared to SLMs. We further show the promise of cross-community content moderation, which has implications for new communities and the development of cross-platform moderation techniques. Finally, we outline directions for future work on language model based content moderation.