How to Use less Features and Reach Better Performance in Author Gender Identification

Juan Soler Company, Leo Wanner


Abstract
Over the last years, author profiling in general and author gender identification in particular have become a popular research area due to their potential attractive applications that range from forensic investigations to online marketing studies. However, nearly all state-of-the-art works in the area still very much depend on the datasets they were trained and tested on, since they heavily draw on content features, mostly a large number of recurrent words or combinations of words extracted from the training sets. We show that using a small number of features that mainly depend on the structure of the texts we can outperform other approaches that depend mainly on the content of the texts and that use a huge number of features in the process of identifying if the author of a text is a man or a woman. Our system has been tested against a dataset constructed for our work as well as against two datasets that were previously used in other papers.
Anthology ID:
L14-1030
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1315–1319
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/104_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Juan Soler Company and Leo Wanner. 2014. How to Use less Features and Reach Better Performance in Author Gender Identification. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 1315–1319, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
How to Use less Features and Reach Better Performance in Author Gender Identification (Soler Company & Wanner, LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/104_Paper.pdf