Abstract
User demographic inference from social media text has the potential to improve a range of downstream applications, including real-time passive polling or quantifying demographic bias. This study focuses on developing models for user-level race and ethnicity prediction. We introduce a data set of users who self-report their race/ethnicity through a survey, in contrast to previous approaches that use distantly supervised data or perceived labels. We develop predictive models from text which accurately predict the membership of a user to the four largest racial and ethnic groups with up to .884 AUC and make these available to the research community.- Anthology ID:
- C18-1130
- Volume:
- Proceedings of the 27th International Conference on Computational Linguistics
- Month:
- August
- Year:
- 2018
- Address:
- Santa Fe, New Mexico, USA
- Editors:
- Emily M. Bender, Leon Derczynski, Pierre Isabelle
- Venue:
- COLING
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1534–1545
- Language:
- URL:
- https://aclanthology.org/C18-1130
- DOI:
- Cite (ACL):
- Daniel Preoţiuc-Pietro and Lyle Ungar. 2018. User-Level Race and Ethnicity Predictors from Twitter Text. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1534–1545, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Cite (Informal):
- User-Level Race and Ethnicity Predictors from Twitter Text (Preoţiuc-Pietro & Ungar, COLING 2018)
- PDF:
- https://preview.aclanthology.org/naacl24-info/C18-1130.pdf