Wikidata as a Source of Demographic Information

Samir Abdaljalil; Hamdy Mubarak

doi:10.18653/v1/2024.arabicnlp-1.1

Wikidata as a Source of Demographic Information

Abstract

Names carry important information about our identities and demographics such as gender, nationality, ethnicity, etc. We investigate the use of individual’s name, in both Arabic and English, to predict important attributes, namely country, region, gender, and language. We extract data from Wikidata, and normalize it, to build a comprehensive dataset consisting of more than 1 million entities and their normalized attributes. We experiment with a Linear SVM approach, as well as two Transformers approaches consisting of BERT model fine-tuning and Transformers pipeline. Our results indicate that we can predict the gender, language and region using the name only with a confidence over 0.65. The country attribute can be predicted with less accuracy. The Linear SVM approach outperforms the other approaches for all the attributes. The best performing approach was also evaluated on another dataset that consists of 1,500 names from 15 countries (covering different regions) extracted from Twitter, and yields similar results.

Anthology ID:: 2024.arabicnlp-1.1
Volume:: Proceedings of the Second Arabic Natural Language Processing Conference
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Nizar Habash, Houda Bouamor, Ramy Eskander, Nadi Tomeh, Ibrahim Abu Farha, Ahmed Abdelali, Samia Touileb, Injy Hamed, Yaser Onaizan, Bashar Alhafni, Wissam Antoun, Salam Khalifa, Hatem Haddad, Imed Zitouni, Badr AlKhamissi, Rawan Almatham, Khalil Mrini
Venues:: ArabicNLP | WS
SIG:: SIGARAB
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1–10
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2024.arabicnlp-1.1/
DOI:: 10.18653/v1/2024.arabicnlp-1.1
Bibkey:
Cite (ACL):: Samir Abdaljalil and Hamdy Mubarak. 2024. Wikidata as a Source of Demographic Information. In Proceedings of the Second Arabic Natural Language Processing Conference, pages 1–10, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: Wikidata as a Source of Demographic Information (Abdaljalil & Mubarak, ArabicNLP 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2024.arabicnlp-1.1.pdf

PDF Cite Search Fix data