Anh C. Pham

2025

Supervised fine-tuning (SFT) on benign data can paradoxically erode a language model’s safety alignment, a phenomenon known as catastrophic forgetting of safety behaviors. Although prior work shows that randomly adding safety examples can reduce harmful output, the principles that make certain examples more effective than others remain poorly understood. This paper investigates the hypothesis that the effectiveness of a safety example is governed by two key factors: its instruction-response behavior (e.g., refusal vs. explanation) and its semantic diversity across harm categories. We systematically evaluate sampling strategies based on these axes and find that structured, diversity-aware sampling significantly improves model safety. Our method reduces harmfulness by up to 41% while adding only 0.05% more data to the fine-tuning set.

2024

pdf bib abs
Inferring Mental Burnout Discourse Across Reddit Communities
Nazanin Sabri | Anh C. Pham | Ishita Kakkar | Mai ElSherief
Proceedings of the Third Workshop on NLP for Positive Impact

Mental burnout refers to a psychological syndrome induced by chronic stress that negatively impacts the emotional and physical well-being of individuals. From the occupational context to personal hobbies, burnout is pervasive across domains and therefore affects the morale and productivity of society as a whole. Currently, no linguistic resources are available for the analysis or detection of burnout language. We address this gap by introducing a dataset annotated for burnout language. Given that social media is a platform for sharing life experiences and mental health struggles, our work examines the manifestation of burnout language in Reddit posts. We introduce a contextual word sense disambiguation approach to identify the specific meaning or context in which the word “burnout” is used, distinguishing between its application in mental health (e.g., job-related stress leading to burnout) and non-mental health contexts (e.g., engine burnout in a mechanical context). We create a dataset of 2,330 manually labeled Reddit posts for this task, as well as annotating the reason the poster associates with their burnout (e.g., professional, personal, non-traditional). We train machine learning models on this dataset achieving a minimum F1 score of 0.84 on the different tasks. We make our dataset of annotated Reddit post IDs publicly available to help advance future research in this field.

Co-authors

Ehi Nosakhare 1

Nazanin Sabri 1

Soundararajan Srinivasan 1

Michael Sun 1

Mihir Thalanki 1

Tian Xia 1

Venues

Fix author