Language-independent Gender Prediction on Twitter

Nikola Ljubešić, Darja Fišer, Tomaž Erjavec


Abstract
In this paper we present a set of experiments and analyses on predicting the gender of Twitter users based on language-independent features extracted either from the text or the metadata of users’ tweets. We perform our experiments on the TwiSty dataset containing manual gender annotations for users speaking six different languages. Our classification results show that, while the prediction model based on language-independent features performs worse than the bag-of-words model when training and testing on the same language, it regularly outperforms the bag-of-words model when applied to different languages, showing very stable results across various languages. Finally we perform a comparative analysis of feature effect sizes across the six languages and show that differences in our features correspond to cultural distances.
Anthology ID:
W17-2901
Volume:
Proceedings of the Second Workshop on NLP and Computational Social Science
Month:
August
Year:
2017
Address:
Vancouver, Canada
Venue:
NLP+CSS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–6
Language:
URL:
https://aclanthology.org/W17-2901
DOI:
10.18653/v1/W17-2901
Bibkey:
Cite (ACL):
Nikola Ljubešić, Darja Fišer, and Tomaž Erjavec. 2017. Language-independent Gender Prediction on Twitter. In Proceedings of the Second Workshop on NLP and Computational Social Science, pages 1–6, Vancouver, Canada. Association for Computational Linguistics.
Cite (Informal):
Language-independent Gender Prediction on Twitter (Ljubešić et al., NLP+CSS 2017)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/W17-2901.pdf