Gender Prediction for Chinese Social Media Data

Wen Li, Markus Dickinson


Abstract
Social media provides users a platform to publish messages and socialize with others, and microblogs have gained more users than ever in recent years. With such usage, user profiling is a popular task in computational linguistics and text mining. Different approaches have been used to predict users’ gender, age, and other information, but most of this work has been done on English and other Western languages. The goal of this project is to predict the gender of users based on their posts on Weibo, a Chinese micro-blogging platform. Given issues in Chinese word segmentation, we explore character and word n-grams as features for this task, as well as using character and word embeddings for classification. Given how the data is extracted, we approach the task on a per-post basis, and we show the difficulties of the task for both humans and computers. Nonetheless, we present encouraging results and point to future improvements.
Anthology ID:
R17-1058
Volume:
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017
Month:
September
Year:
2017
Address:
Varna, Bulgaria
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
438–445
Language:
URL:
https://doi.org/10.26615/978-954-452-049-6_058
DOI:
10.26615/978-954-452-049-6_058
Bibkey:
Cite (ACL):
Wen Li and Markus Dickinson. 2017. Gender Prediction for Chinese Social Media Data. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, pages 438–445, Varna, Bulgaria. INCOMA Ltd..
Cite (Informal):
Gender Prediction for Chinese Social Media Data (Li & Dickinson, RANLP 2017)
Copy Citation:
PDF:
https://doi.org/10.26615/978-954-452-049-6_058