A Semi-Supervised Approach for Gender Identification

Juan Soler, Leo Wanner


Abstract
In most of the research studies on Author Profiling, large quantities of correctly labeled data are used to train the models. However, this does not reflect the reality in forensic scenarios: in practical linguistic forensic investigations, the resources that are available to profile the author of a text are usually scarce. To pay tribute to this fact, we implemented a Semi-Supervised Learning variant of the k nearest neighbors algorithm that uses small sets of labeled data and a larger amount of unlabeled data to classify the authors of texts by gender (man vs woman). We describe the enriched KNN algorithm and show that the use of unlabeled instances improves the accuracy of our gender identification model. We also present a feature set that facilitates the use of a very small number of instances, reaching accuracies higher than 70% with only 113 instances to train the model. It is also shown that the algorithm also performs well using publicly available data.
Anthology ID:
L16-1204
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1282–1287
Language:
URL:
https://aclanthology.org/L16-1204
DOI:
Bibkey:
Cite (ACL):
Juan Soler and Leo Wanner. 2016. A Semi-Supervised Approach for Gender Identification. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 1282–1287, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
A Semi-Supervised Approach for Gender Identification (Soler & Wanner, LREC 2016)
Copy Citation:
PDF:
https://preview.aclanthology.org/ml4al-ingestion/L16-1204.pdf