David Smahel
2026
Detecting Risky Behavior Related to Alcohol and Drug Use within Adolescents’ Private Messenger Conversations
Jaromír Plhák | Michaela Lebedíková | Ondrej Sotolar | David Smahel
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Jaromír Plhák | Michaela Lebedíková | Ondrej Sotolar | David Smahel
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Alcohol and drug use negatively impact adolescents’ health, making early detection and prevention essential. One promising approach involves analyzing adolescents’ online conversations for signs of substance use. However, current machine learning models for online detection often rely on public data sources that fail to capture the private experiences of adolescents. In this study, we developed a BERT-based machine learning model to automatically identify discussions about alcohol and drug use with high accuracy, leveraging private messenger conversations from adolescents. Our novel dataset comprises 272,465 annotated utterances from a corpus of 1,260,492 utterances in 2,807 chats authored by 2,165 individuals, primarily in Czech. Our best BERT-based machine learning model achieved a solid F1 score of 0.817, demonstrating the feasibility of addressing this social science task even in low-resource languages like Czech. Additionally, we verified that state-of-the-art generative open-source large language models are equally effective for this task and can be successfully adapted for other languages, including English. We also analyzed misclassified utterances to identify problematic patterns and improve model performance. The resulting models have significant practical implications for parental mediation software and parental control applications. By automating substance use detection and enabling appropriate real-time interventions, these tools can contribute to safeguarding adolescents’ health.
2025
Modeling the Differential Prevalence of Online Supportive Interactions in Private Instant Messages of Adolescents
Ondrej Sotolar | Michał Tkaczyk | Jaromír Plhák | David Smahel
Findings of the Association for Computational Linguistics: NAACL 2025
Ondrej Sotolar | Michał Tkaczyk | Jaromír Plhák | David Smahel
Findings of the Association for Computational Linguistics: NAACL 2025
This paper focuses on modeling gender-based and pair-or-group disparities in online supportive interactions among adolescents. To address the limitations of conventional social science methods in handling large datasets, this research employs language models to detect supportive interactions based on the Social Support Behavioral Code and to model their distribution. The study conceptualizes detection as a classification task, constructs a new dataset, and trains predictive models. The novel dataset comprises 196,772 utterances from 2165 users collected from Instant Messenger apps. The results show that the predictions of language models can be used to effectively model the distribution of supportive interactions in private online dialogues. As a result, this study provides new computational evidence that supports the theory that supportive interactions are more prevalent in online female-to-female conversations. The findings advance our understanding of supportive interactions in adolescent communication and present methods to automate the analysis of large datasets, opening new research avenues in computational social science.