In this paper we present a set of experiments and analyses on predicting the gender of Twitter users based on language-independent features extracted either from the text or the metadata of users’ tweets. We perform our experiments on the TwiSty dataset containing manual gender annotations for users speaking six different languages. Our classification results show that, while the prediction model based on language-independent features performs worse than the bag-of-words model when training and testing on the same language, it regularly outperforms the bag-of-words model when applied to different languages, showing very stable results across various languages. Finally we perform a comparative analysis of feature effect sizes across the six languages and show that differences in our features correspond to cultural distances.
Sexism is prevalent in today’s society, both offline and online, and poses a credible threat to social equality with respect to gender. According to ambivalent sexism theory (Glick and Fiske, 1996), it comes in two forms: Hostile and Benevolent. While hostile sexism is characterized by an explicitly negative attitude, benevolent sexism is more subtle. Previous works on computationally detecting sexism present online are restricted to identifying the hostile form. Our objective is to investigate the less pronounced form of sexism demonstrated online. We achieve this by creating and analyzing a dataset of tweets that exhibit benevolent sexism. By using Support Vector Machines (SVM), sequence-to-sequence models and FastText classifier, we classify tweets into ‘Hostile’, ‘Benevolent’ or ‘Others’ class depending on the kind of sexism they exhibit. We have been able to achieve an F1-score of 87.22% using FastText classifier. Our work helps analyze and understand the much prevalent ambivalent sexism in social media.
Personality plays a decisive role in how people behave in different scenarios, including online social media. Researchers have used such data to study how personality can be predicted from language use. In this paper, we study phrase choice as a particular stylistic linguistic difference, as opposed to the mostly topical differences identified previously. Building on previous work on demographic preferences, we quantify differences in paraphrase choice from a massive Facebook data set with posts from over 115,000 users. We quantify the predictive power of phrase choice in user profiling and use phrase choice to study psycholinguistic hypotheses. This work is relevant to future applications that aim to personalize text generation to specific personality types.
Vector embeddings of words have been shown to encode meaningful semantic relationships that enable solving of complex analogies. This vector embedding concept has been extended successfully to many different domains and in this paper we both create and visualize vector representations of an unstructured collection of online communities based on user participation. Further, we quantitatively and qualitatively show that these representations allow solving of semantically meaningful community analogies and also other more general types of relationships. These results could help improve community recommendation engines and also serve as a tool for sociological studies of community relatedness.
In this paper, we evaluate the predictability of tweets associated with controversial versus non-controversial topics. As a first step, we crowd-sourced the scoring of a predefined set of topics on a Likert scale from non-controversial to controversial. Our feature set entails and goes beyond sentiment features, e.g., by leveraging empathic language and other features that have been previously used but are new for this particular study. We find focusing on the structural characteristics of tweets to be beneficial for this task. Using a combination of emphatic, language-specific, and Twitter-specific features for supervised learning resulted in 87% accuracy (F1) for cross-validation of the training set and 63.4% accuracy when using the test set. Our analysis shows that features specific to Twitter or social media, in general, are more prevalent in tweets on controversial topics than in non-controversial ones. To test the premise of the paper, we conducted two additional sets of experiments, which led to mixed results. This finding will inform our future investigations into the relationship between language use on social media and the perceived controversiality of topics.
In this paper, we propose an approach for cross-lingual topical coding of sentences from electoral manifestos of political parties in different languages. To this end, we exploit continuous semantic text representations and induce a joint multilingual semantic vector spaces to enable supervised learning using manually-coded sentences across different languages. Our experimental results show that classifiers trained on multilingual data yield performance boosts over monolingual topic classification.
Research in Social Science is usually based on survey data where individual research questions relate to observable concepts (variables). However, due to a lack of standards for data citations a reliable identification of the variables used is often difficult. In this paper, we present a work-in-progress study that seeks to provide a solution to the variable detection task based on supervised machine learning algorithms, using a linguistic analysis pipeline to extract a rich feature set, including terminological concepts and similarity metric scores. Further, we present preliminary results on a small dataset that has been specifically designed for this task, yielding a significant increase in performance over the random baseline.
There has been a long standing interest in understanding ‘Social Influence’ both in Social Sciences and in Computational Linguistics. In this paper, we present a novel approach to study and measure interpersonal influence in daily interactions. Motivated by the basic principles of influence, we attempt to identify indicative linguistic features of the posts in an online knitting community. We present the scheme used to operationalize and label the posts as influential or non-influential. Experiments with the identified features show an improvement in the classification accuracy of influence by 3.15%. Our results illustrate the important correlation between the structure of the language and its potential to influence others.
Previous work on classifying Twitter users’ political alignment has mainly focused on lexical and social network features. This study provides evidence that political affiliation is also reflected in features which have been previously overlooked: users’ discourse patterns (proportion of Tweets that are retweets or replies) and their rate of use of capitalization and punctuation. We find robust differences between politically left- and right-leaning communities with respect to these discourse and sub-lexical features, although they are not enough to train a high-accuracy classifier.
We explore a novel computational approach for analyzing member participation in small group social sequences. Using a complex state representation combining information about dialogue act types, sentiment expression, and participant roles, we explore which sequence states are associated with high levels of member participation. Using a Markov Rewards framework, we associate particular states with immediate positive and negative rewards, and employ a Value Iteration algorithm to calculate the expected value of all states. In our findings, we focus on discourse states belonging to team leaders and project managers which are either very likely or very unlikely to lead to participation from the rest of the group members.
Code-switching has been found to have social motivations in addition to syntactic constraints. In this work, we explore the social effect of code-switching in an online community. We present a task from the Arabic Wikipedia to capture language choice, in this case code-switching between Arabic and other languages, as a predictor of social influence in collaborative editing. We find that code-switching is positively associated with Wikipedia editor success, particularly borrowing technical language on pages with topics less directly related to Arabic-speaking regions.
Demographically-tagged social media messages are a common source of data for computational social science. While these messages can indicate differences in beliefs and behaviors between demographic groups, we do not have a clear understanding of how different demographic groups use platforms such as Twitter. This paper presents a preliminary analysis of how groups’ differing behaviors may confound analyses of the groups themselves. We analyzed one million Twitter users by first inferring demographic attributes, and then measuring several indicators of Twitter behavior. We find differences in these indicators across demographic groups, suggesting that there may be underlying differences in how different demographic groups use Twitter.
Politicians carefully word their statements in order to influence how others view an issue, a political strategy called framing. Simultaneously, these frames may also reveal the beliefs or positions on an issue of the politician. Simple language features such as unigrams, bigrams, and trigrams are important indicators for identifying the general frame of a text, for both longer congressional speeches and shorter tweets of politicians. However, tweets may contain multiple unigrams across different frames which limits the effectiveness of this approach. In this paper, we present a joint model which uses both linguistic features of tweets and ideological phrase indicators extracted from a state-of-the-art embedding-based model to predict the general frame of political tweets.