Ana-Maria Bucur

Also published as: Ana-maria Bucur

2024

Research on psychological risk factors for suicide has developed for decades. However, combining explainable theory with modern data-driven language model approaches is non-trivial. In this study, we propose and evaluate methods for identifying language patterns aligned with theories of suicide risk by combining theory-driven suicidal archetypes with language model-based and relative entropy-based approaches. Archetypes are based on prototypical statements that evince risk of suicidality while relative entropy considers the ratio of how unusual both a risk-familiar and unfamiliar model find the statements. While both approaches independently performed similarly, we find that combining the two significantly improved the performance in the shared task evaluations, yielding our combined system submission with a BERTScore Recall of 0.906. Consistent with the literature, we find that titles are highly informative as suicide risk evidence, despite the brevity. We conclude that a combination of theory- and data-driven methods are needed in the mental health space and can outperform more modern prompt-based methods.

pdf abs
MentalHelp: A Multi-Task Dataset for Mental Health in Social Media
Nishat Raihan | Sadiya Sayara Chowdhury Puspo | Shafkat Farabi | Ana-Maria Bucur | Tharindu Ranasinghe | Marcos Zampieri
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Early detection of mental health disorders is an essential step in treating and preventing mental health conditions. Computational approaches have been applied to users’ social media profiles in an attempt to identify various mental health conditions such as depression, PTSD, schizophrenia, and eating disorders. The interest in this topic has motivated the creation of various depression detection datasets. However, annotating such datasets is expensive and time-consuming, limiting their size and scope. To overcome this limitation, we present MentalHelp, a large-scale semi-supervised mental disorder detection dataset containing 14 million instances. The corpus was collected from Reddit and labeled in a semi-supervised way using an ensemble of three separate models - flan-T5, Disor-BERT, and Mental-BERT.

pdf abs
UniBuc at SemEval-2024 Task 2: Tailored Prompting with Solar for Clinical NLI
Marius Micluta-Campeanu | Claudiu Creanga | Ana-maria Bucur | Ana Sabina Uban | Liviu P. Dinu
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

This paper describes the approach of the UniBuc team in tackling the SemEval 2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials. We used SOLAR Instruct, without any fine-tuning, while focusing on input manipulation and tailored prompting. By customizing prompts for individual CTR sections, in both zero-shot and few-shots settings, we managed to achieve a consistency score of 0.72, ranking 14th in the leaderboard. Our thorough error analysis revealed that our model has a tendency to take shortcuts and rely on simple heuristics, especially when dealing with semantic-preserving changes.

2023

pdf abs
Automatic Extraction of the Romanian Academic Word List: Data and Methods
Ana-Maria Bucur | Andreea Dincă | Madalina Chitez | Roxana Rogobete
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

This paper presents the methodology and data used for the automatic extraction of the Romanian Academic Word List (Ro-AWL). Academic Word Lists are useful in both L2 and L1 teaching contexts. For the Romanian language, no such resource exists so far. Ro-AWL has been generated by combining methods from corpus and computational linguistics with L2 academic writing approaches. We use two types of data: (a) existing data, such as the Romanian Frequency List based on the ROMBAC corpus, and (b) self-compiled data, such as the expert academic writing corpus EXPRES. For constructing the academic word list, we follow the methodology for building the Academic Vocabulary List for the English language. The distribution of Ro-AWL features (general distribution, POS distribution) into four disciplinary datasets is in line with previous research. Ro-AWL is freely available and can be used for teaching, research and NLP applications.

2022

pdf abs
Capturing Changes in Mood Over Time in Longitudinal Data Using Ensemble Methodologies
Ana-Maria Bucur | Hyewon Jang | Farhana Ferdousi Liza
Proceedings of the Eighth Workshop on Computational Linguistics and Clinical Psychology

This paper presents the system description of team BLUE for Task A of the CLPsych 2022 Shared Task on identifying changes in mood and behaviour in longitudinal textual data. These moments of change are signals that can be used to screen and prevent suicide attempts. To detect these changes, we experimented with several text representation methods, such as TF-IDF, sentence embeddings, emotion-informed embeddings and several classical machine learning classifiers. We chose to submit three runs of ensemble systems based on maximum voting on the predictions from the best performing models. Of the nine participating teams in Task A, our team ranked second in the Precision-oriented Coverage-based Evaluation, with a score of 0.499. Our best system was an ensemble of Support Vector Machine, Logistic Regression, and Adaptive Boosting classifiers using emotion-informed embeddings as input representation that can model both the linguistic and emotional information found in users? posts.

pdf abs
Users Hate Blondes: Detecting Sexism in User Comments on Online Romanian News
Andreea Moldovan | Karla Csürös | Ana-maria Bucur | Loredana Bercuci
Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH)

Romania ranks almost last in Europe when it comes to gender equality in political representation, with about 10% fewer women in politics than the E.U. average. We proceed from the assumption that this underrepresentation is also influenced by the sexism and verbal abuse female politicians face in the public sphere, especially in online media. We collect a novel dataset with sexist comments in Romanian language from newspaper articles about Romanian female politicians and propose baseline models using classical machine learning models and fine-tuned pretrained transformer models for the classification of sexist language in the online medium.

pdf abs
Life is not Always Depressing: Exploring the Happy Moments of People Diagnosed with Depression
Ana-Maria Bucur | Adrian Cosma | Liviu P. Dinu
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this work, we explore the relationship between depression and manifestations of happiness in social media. While the majority of works surrounding depression focus on symptoms, psychological research shows that there is a strong link between seeking happiness and being diagnosed with depression. We make use of Positive-Unlabeled learning paradigm to automatically extract happy moments from social media posts of both controls and users diagnosed with depression, and qualitatively analyze them with linguistic tools such as LIWC and keyness information. We show that the life of depressed individuals is not always bleak, with positive events related to friends and family being more noteworthy to their lives compared to the more mundane happy events reported by control users.

pdf abs
EXPRES Corpus for A Field-specific Automated Exploratory Study of L2 English Expert Scientific Writing
Ana-Maria Bucur | Madalina Chitez | Valentina Muresan | Andreea Dinca | Roxana Rogobete
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Field Specific Expert Scientific Writing in English as a Lingua Franca is essential for the effective research networking and dissemination worldwide. Extracting the linguistic profile of the research articles written in L2 English can help young researchers and expert scholars in various disciplines adapt to the scientific writing norms of their communities of practice. In this exploratory study, we present and test an automated linguistic assessment model that includes features relevant for the cross-disciplinary second language framework: Text Complexity Analysis features, such as Syntactic and Lexical Complexity, and Field Specific Academic Word Lists. We analyse how these features vary across four disciplinary fields (Economics, IT, Linguistics and Political Science) in a corpus of L2-English Expert Scientific Writing, part of the EXPRES corpus (Corpus of Expert Writing in Romanian and English). The variation in field specific writing is also analysed in groups of linguistic features extracted from the higher visibility (Hv) versus lower visibility (Lv) journals. After applying lexical sophistication, lexical variation and syntactic complexity formulae, significant differences between disciplines were identified, mainly that research articles from Lv journals have higher lexical complexity, but lower syntactic complexity than articles from Hv journals; while academic vocabulary proved to have discipline specific variation.

2021

abs
Natural language processing as a tool to identify the Reddit particularities of cancer survivors around the time of diagnosis and remission: A pilot study
Ioana R. Podină | Ana-Maria Bucur | Diana Todea | Liviu Fodor | Andreea Luca | Liviu P. Dinu | Rareș Boian
Proceedings of the Fifth Workshop on Widening Natural Language Processing

In the current study, we analyzed 15297 texts from 39 cancer survivors who posted or commented on Reddit in order to detect the language particularities of cancer survivors from online discourse. We performed a computational linguistic analysis (part-of-speech analysis, emoji detection, sentiment analysis) on submissions around the time of the cancer diagnosis and around the time of remission. We found several significant differences in the texts posted around the time of remission compared to those around the time of diagnosis. Though our results need to be backed up by a higher corpus of data, they do cue to the fact that cancer survivors, around the time of remission, focus more on others, are more active on social media, and do not see the glass as half empty as suggested by the valence of the emojis.

pdf abs
A Psychologically Informed Part-of-Speech Analysis of Depression in Social Media
Ana-Maria Bucur | Ioana R. Podina | Liviu P. Dinu
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

In this work, we provide an extensive part-of-speech analysis of the discourse of social media users with depression. Research in psychology revealed that depressed users tend to be self-focused, more preoccupied with themselves and ruminate more about their lives and emotions. Our work aims to make use of large-scale datasets and computational methods for a quantitative exploration of discourse. We use the publicly available depression dataset from the Early Risk Prediction on the Internet Workshop (eRisk) 2018 and extract part-of-speech features and several indices based on them. Our results reveal statistically significant differences between the depressed and non-depressed individuals confirming findings from the existing psychology literature. Our work provides insights regarding the way in which depressed individuals are expressing themselves on social media platforms, allowing for better-informed computational models to help monitor and prevent mental illnesses.

pdf abs
Sequence-to-Sequence Lexical Normalization with Multilingual Transformers
Ana-Maria Bucur | Adrian Cosma | Liviu P. Dinu
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

Current benchmark tasks for natural language processing contain text that is qualitatively different from the text used in informal day to day digital communication. This discrepancy has led to severe performance degradation of state-of-the-art NLP models when fine-tuned on real-world data. One way to resolve this issue is through lexical normalization, which is the process of transforming non-standard text, usually from social media, into a more standardized form. In this work, we propose a sentence-level sequence-to-sequence model based on mBART, which frames the problem as a machine translation problem. As the noisy text is a pervasive problem across languages, not just English, we leverage the multi-lingual pre-training of mBART to fine-tune it to our data. While current approaches mainly operate at the word or subword level, we argue that this approach is straightforward from a technical standpoint and builds upon existing pre-trained transformer networks. Our results show that while word-level, intrinsic, performance evaluation is behind other methods, our model improves performance on extrinsic, downstream tasks through normalization compared to models operating on raw, unprocessed, social media text.

pdf
An Exploratory Analysis of the Relation between Offensive Language and Mental Health
Ana-Maria Bucur | Marcos Zampieri | Liviu P. Dinu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021