@inproceedings{abdaljalil-mubarak-2024-wikidata,
    title = "{W}ikidata as a Source of Demographic Information",
    author = "Abdaljalil, Samir  and
      Mubarak, Hamdy",
    editor = "Habash, Nizar  and
      Bouamor, Houda  and
      Eskander, Ramy  and
      Tomeh, Nadi  and
      Abu Farha, Ibrahim  and
      Abdelali, Ahmed  and
      Touileb, Samia  and
      Hamed, Injy  and
      Onaizan, Yaser  and
      Alhafni, Bashar  and
      Antoun, Wissam  and
      Khalifa, Salam  and
      Haddad, Hatem  and
      Zitouni, Imed  and
      AlKhamissi, Badr  and
      Almatham, Rawan  and
      Mrini, Khalil",
    booktitle = "Proceedings of the Second Arabic Natural Language Processing Conference",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://preview.aclanthology.org/ingest-emnlp/2024.arabicnlp-1.1/",
    doi = "10.18653/v1/2024.arabicnlp-1.1",
    pages = "1--10",
    abstract = "Names carry important information about our identities and demographics such as gender, nationality, ethnicity, etc. We investigate the use of individual{'}s name, in both Arabic and English, to predict important attributes, namely country, region, gender, and language. We extract data from Wikidata, and normalize it, to build a comprehensive dataset consisting of more than 1 million entities and their normalized attributes. We experiment with a Linear SVM approach, as well as two Transformers approaches consisting of BERT model fine-tuning and Transformers pipeline. Our results indicate that we can predict the gender, language and region using the name only with a confidence over 0.65. The country attribute can be predicted with less accuracy. The Linear SVM approach outperforms the other approaches for all the attributes. The best performing approach was also evaluated on another dataset that consists of 1,500 names from 15 countries (covering different regions) extracted from Twitter, and yields similar results."
}