Ankit Aich

2022

pdf abs
TweetTaglish: A Dataset for Investigating Tagalog-English Code-Switching
Megan Herrera | Ankit Aich | Natalie Parde
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Deploying recent natural language processing innovations to low-resource settings allows for state-of-the-art research findings and applications to be accessed across cultural and linguistic borders. One low-resource setting of increasing interest is code-switching, the phenomenon of combining, swapping, or alternating the use of two or more languages in continuous dialogue. In this paper, we introduce a large dataset (20k+ instances) to facilitate investigation of Tagalog-English code-switching, which has become a popular mode of discourse in Philippine culture. Tagalog is an Austronesian language and former official language of the Philippines spoken by over 23 million people worldwide, but it and Tagalog-English are under-represented in NLP research and practice. We describe our methods for data collection, as well as our labeling procedures. We analyze our resulting dataset, and finally conclude by providing results from a proof-of-concept regression task to establish dataset validity, achieving a strong performance benchmark (R2=0.797-0.909; RMSE=0.068-0.057).

pdf abs
Telling a Lie: Analyzing the Language of Information and Misinformation during Global Health Events
Ankit Aich | Natalie Parde
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The COVID-19 pandemic and other global health events are unfortunately excellent environments for the creation and spread of misinformation, and the language associated with health misinformation may be typified by unique patterns and linguistic markers. Allowing health misinformation to spread unchecked can have devastating ripple effects; however, detecting and stopping its spread requires careful analysis of these linguistic characteristics at scale. We analyze prior investigations focusing on health misinformation, associated datasets, and detection of misinformation during health crises. We also introduce a novel dataset designed for analyzing such phenomena, comprised of 2.8 million news articles and social media posts spanning the early 1900s to the present. Our annotation guidelines result in strong agreement between independent annotators. We describe our methods for collecting this data and follow this with a thorough analysis of the themes and linguistic features that appear in information versus misinformation. Finally, we demonstrate a proof-of-concept misinformation detection task to establish dataset validity, achieving a strong performance benchmark (accuracy = 75%; F1 = 0.7).

pdf abs
Are You Really Okay? A Transfer Learning-based Approach for Identification of Underlying Mental Illnesses
Ankit Aich | Natalie Parde
Proceedings of the Eighth Workshop on Computational Linguistics and Clinical Psychology

Evidence has demonstrated the presence of similarities in language use across people with various mental health conditions. In this work, we investigate these correlations both in terms of literature and as a data analysis problem. We also introduce a novel state-of-the-art transfer learning-based approach that learns from linguistic feature spaces of previous conditions and predicts unknown ones. Our model achieves strong performance, with F1 scores of 0.75, 0.80, and 0.76 at detecting depression, stress, and suicidal ideation in a first-of-its-kind transfer task and offering promising evidence that language models can harness learned patterns from known mental health conditions to aid in their prediction of others that may lie latent.

NLP offers a myriad of opportunities to support mental health research. However, prior work has almost exclusively focused on social media data, for which diagnoses are difficult or impossible to validate. We present a first-of-its-kind dataset of manually transcribed interactions with people clinically diagnosed with bipolar disorder and schizophrenia, as well as healthy controls. Data was collected through validated clinical tasks and paired with diagnostic measures. We extract 100+ temporal, sentiment, psycholinguistic, emotion, and lexical features from the data and establish classification validity using a variety of models to study language differences between diagnostic groups. Our models achieve strong classification performance (maximum F1=0.93-0.96), and lead to the discovery of interesting associations between linguistic features and diagnostic class. It is our hope that this dataset will offer high value to clinical and NLP researchers, with potential for widespread broader impacts.

pdf abs
Demystifying Neural Fake News via Linguistic Feature-Based Interpretation
Ankit Aich | Souvik Bhattacharya | Natalie Parde
Proceedings of the 29th International Conference on Computational Linguistics

The spread of fake news can have devastating ramifications, and recent advancements to neural fake news generators have made it challenging to understand how misinformation generated by these models may best be confronted. We conduct a feature-based study to gain an interpretative understanding of the linguistic attributes that neural fake news generators may most successfully exploit. When comparing models trained on subsets of our features and confronting the models with increasingly advanced neural fake news, we find that stylistic features may be the most robust. We discuss our findings, subsequent analyses, and broader implications in the pages within.

Co-authors

Philip Harvey 1

Colin Depp 1

Souvik Bhattacharya 1

Ankit Aich

2022

Co-authors

Venues