Sourabh Zanwar


2023

pdf
SMHD-GER: A Large-Scale Benchmark Dataset for Automatic Mental Health Detection from Social Media in German
Sourabh Zanwar | Daniel Wiechmann | Yu Qiao | Elma Kerz
Findings of the Association for Computational Linguistics: EACL 2023

Mental health problems are a challenge to our modern society, and their prevalence is predicted to increase worldwide. Recently, a surge of research has demonstrated the potential of automated detection of mental health conditions (MHC) through social media posts, with the ultimate goal of enabling early intervention and monitoring population-level health outcomes in real-time. Progress in this area of research is highly dependent on the availability of high-quality datasets and benchmark corpora. However, the publicly available datasets for understanding and modelling MHC are largely confined to the English language. In this paper, we introduce SMHD-GER (Self-Reported Mental Health Diagnoses for German), a large-scale, carefully constructed dataset for MHC detection built on high-precision patterns and the approach proposed for English. We provide benchmark models for this dataset to facilitate further research and conduct extensive experiments. These models leverage engineered (psycho-)linguistic features as well as BERT-German. We also examine nuanced patterns of linguistic markers characteristics of specific MHC.

pdf
What to Fuse and How to Fuse: Exploring Emotion and Personality Fusion Strategies for Explainable Mental Disorder Detection
Sourabh Zanwar | Xiaofei Li | Daniel Wiechmann | Yu Qiao | Elma Kerz
Findings of the Association for Computational Linguistics: ACL 2023

Mental health disorders (MHD) are increasingly prevalent worldwide and constitute one of the greatest challenges facing our healthcare systems and modern societies in general. In response to this societal challenge, there has been a surge in digital mental health research geared towards the development of new techniques for unobtrusive and efficient automatic detection of MHD. Within this area of research, natural language processing techniques are playing an increasingly important role, showing promising detection results from a variety of textual data. Recently, there has been a growing interest in improving mental illness detection from textual data by way of leveraging emotions: ‘Emotion fusion’ refers to the process of integrating emotion information with general textual information to obtain enhanced information for decision-making. However, while the available research has shown that MHD prediction can be improved through a variety of different fusion strategies, previous works have been confined to a particular fusion strategy applied to a specific dataset, and so is limited by the lack of meaningful comparability.In this work, we integrate and extend this research by conducting extensive experiments with three types of deep learning-based fusion strategies: (i) feature-level fusion, where a pre-trained masked language model for mental health detection (MentalRoBERTa) was infused with a comprehensive set of engineered features, (ii) model fusion, where the MentalRoBERTa model was infused with hidden representations of other language models and (iii) task fusion, where a multi-task framework was leveraged to learn the features for auxiliary tasks. In addition to exploring the role of different fusion strategies, we expand on previous work by broadening the information infusion to include a second domain related to mental health, namely personality. We evaluate algorithm performance on data from two benchmark datasets, encompassing five mental health conditions: attention deficit hyperactivity disorder, anxiety, bipolar disorder, depression and psychological stress.

2022

pdf bib
Improving the Generalizability of Text-Based Emotion Detection by Leveraging Transformers with Psycholinguistic Features
Sourabh Zanwar | Daniel Wiechmann | Yu Qiao | Elma Kerz
Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)

recent years, there has been increased interest in building predictive models that harness natural language processing and machine learning techniques to detect emotions from various text sources, including social media posts, micro-blogs or news articles. Yet, deployment of such models in real-world sentiment and emotion applications faces challenges, in particular poor out-of-domain generalizability. This is likely due to domain-specific differences (e.g., topics, communicative goals, and annotation schemes) that make transfer between different models of emotion recognition difficult. In this work we propose approaches for text-based emotion detection that leverage transformer models (BERT and RoBERTa) in combination with Bidirectional Long Short-Term Memory (BiLSTM) networks trained on a comprehensive set of psycholinguistic features. First, we evaluate the performance of our models within-domain on two benchmark datasets GoEmotion (Demszky et al., 2020) and ISEAR (Scherer and Wallbott, 1994). Second, we conduct transfer learning experiments on six datasets from the Unified Emotion Dataset (Bostan and Klinger, 2018) to evaluate their out-of-domain robustness. We find that the proposed hybrid models improve the ability to generalize to out-of-distribution data compared to a standard transformer-based approach. Moreover, we observe that these models perform competitively on in-domain data.’

pdf
MANTIS at SMM4H’2022: Pre-Trained Language Models Meet a Suite of Psycholinguistic Features for the Detection of Self-Reported Chronic Stress
Sourabh Zanwar | Daniel Wiechmann | Yu Qiao | Elma Kerz
Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task

This paper describes our submission to Social Media Mining for Health (SMM4H) 2022 Shared Task 8, aimed at detecting self-reported chronic stress on Twitter. Our approach leverages a pre-trained transformer model (RoBERTa) in combination with a Bidirectional Long Short-Term Memory (BiLSTM) network trained on a diverse set of psycholinguistic features. We handle the class imbalance issue in the training dataset by augmenting it by another dataset used for stress classification in social media.

pdf
The Best of Both Worlds: Combining Engineered Features with Transformers for Improved Mental Health Prediction from Reddit Posts
Sourabh Zanwar | Daniel Wiechmann | Yu Qiao | Elma Kerz
Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task

In recent years, there has been increasing interest in the application of natural language processing and machine learning techniques to the detection of mental health conditions (MHC) based on social media data. In this paper, we aim to improve the state-of-the-art (SoTA) detection of six MHC in Reddit posts in two ways: First, we built models leveraging Bidirectional Long Short-Term Memory (BLSTM) networks trained on in-text distributions of a comprehensive set of psycholinguistic features for more explainable MHC detection as compared to black-box solutions. Second, we combine these BLSTM models with Transformers to improve the prediction accuracy over SoTA models. In addition, we uncover nuanced patterns of linguistic markers characteristic of specific MHC.

pdf
Pushing on Personality Detection from Verbal Behavior: A Transformer Meets Text Contours of Psycholinguistic Features
Elma Kerz | Yu Qiao | Sourabh Zanwar | Daniel Wiechmann
Proceedings of the 12th Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis

Research at the intersection of personality psychology, computer science, and linguistics has recently focused increasingly on modeling and predicting personality from language use. We report two major improvements in predicting personality traits from text data: (1) to our knowledge, the most comprehensive set of theory-based psycholinguistic features and (2) hybrid models that integrate a pre-trained Transformer Language Model BERT and Bidirectional Long Short-Term Memory (BLSTM) networks trained on within-text distributions (‘text contours’) of psycholinguistic features. We experiment with BLSTM models (with and without Attention) and with two techniques for applying pre-trained language representations from the transformer model - ‘feature-based’ and ‘fine-tuning’. We evaluate the performance of the models we built on two benchmark datasets that target the two dominant theoretical models of personality: the Big Five Essay dataset (Pennebaker and King, 1999) and the MBTI Kaggle dataset (Li et al., 2018). Our results are encouraging as our models outperform existing work on the same datasets. More specifically, our models achieve improvement in classification accuracy by 2.9% on the Essay dataset and 8.28% on the Kaggle MBTI dataset. In addition, we perform ablation experiments to quantify the impact of different categories of psycholinguistic features in the respective personality prediction models.

pdf
Exploring Hybrid and Ensemble Models for Multiclass Prediction of Mental Health Status on Social Media
Sourabh Zanwar | Daniel Wiechmann | Yu Qiao | Elma Kerz
Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI)

In recent years, there has been a surge of interest in research on automatic mental health detection (MHD) from social media data leveraging advances in natural language processing and machine learning techniques. While significant progress has been achieved in this interdisciplinary research area, the vast majority of work has treated MHD as a binary classification task. The multiclass classification setup is, however, essential if we are to uncover the subtle differences among the statistical patterns of language use associated with particular mental health conditions. Here, we report on experiments aimed at predicting six conditions (anxiety, attention deficit hyperactivity disorder, bipolar disorder, post-traumatic stress disorder, depression, and psychological stress) from Reddit social media posts. We explore and compare the performance of hybrid and ensemble models leveraging transformer-based architectures (BERT and RoBERTa) and BiLSTM neural networks trained on within-text distributions of a diverse set of linguistic features. This set encompasses measures of syntactic complexity, lexical sophistication and diversity, readability, and register-specific ngram frequencies, as well as sentiment and emotion lexicons. In addition, we conduct feature ablation experiments to investigate which types of features are most indicative of particular mental health conditions.

pdf
SPADE: A Big Five-Mturk Dataset of Argumentative Speech Enriched with Socio-Demographics for Personality Detection
Elma Kerz | Yu Qiao | Sourabh Zanwar | Daniel Wiechmann
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In recent years, there has been increasing interest in automatic personality detection based on language. Progress in this area is highly contingent upon the availability of datasets and benchmark corpora. However, publicly available datasets for modeling and predicting personality traits are still scarce. While recent efforts to create such datasets from social media (Twitter, Reddit) are to be applauded, they often do not include continuous and contextualized language use. In this paper, we introduce SPADE, the first dataset with continuous samples of argumentative speech labeled with the Big Five personality traits and enriched with socio-demographic data (age, gender, education level, language background). We provide benchmark models for this dataset to facilitate further research and conduct extensive experiments. Our models leverage 436 (psycho)linguistic features extracted from transcribed speech and speaker-level metainformation with transformers. We conduct feature ablation experiments to investigate which types of features contribute to the prediction of individual personality traits.