User-Level Race and Ethnicity Predictors from Twitter Text

Daniel Preoţiuc-Pietro, Lyle Ungar


Abstract
User demographic inference from social media text has the potential to improve a range of downstream applications, including real-time passive polling or quantifying demographic bias. This study focuses on developing models for user-level race and ethnicity prediction. We introduce a data set of users who self-report their race/ethnicity through a survey, in contrast to previous approaches that use distantly supervised data or perceived labels. We develop predictive models from text which accurately predict the membership of a user to the four largest racial and ethnic groups with up to .884 AUC and make these available to the research community.
Anthology ID:
C18-1130
Volume:
Proceedings of the 27th International Conference on Computational Linguistics
Month:
August
Year:
2018
Address:
Santa Fe, New Mexico, USA
Editors:
Emily M. Bender, Leon Derczynski, Pierre Isabelle
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1534–1545
Language:
URL:
https://aclanthology.org/C18-1130
DOI:
Bibkey:
Cite (ACL):
Daniel Preoţiuc-Pietro and Lyle Ungar. 2018. User-Level Race and Ethnicity Predictors from Twitter Text. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1534–1545, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Cite (Informal):
User-Level Race and Ethnicity Predictors from Twitter Text (Preoţiuc-Pietro & Ungar, COLING 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl24-info/C18-1130.pdf