Abstract
We present results of the first gender classification experiments on Slovene text to our knowledge. Inspired by the TwiSty corpus and experiments (Verhoeven et al., 2016), we employed the Janes corpus (Erjavec et al., 2016) and its gender annotations to perform gender classification experiments on Twitter text comparing a token-based and a lemma-based approach. We find that the token-based approach (92.6% accuracy), containing gender markings related to the author, outperforms the lemma-based approach by about 5%. Especially in the lemmatized version, we also observe stylistic and content-based differences in writing between men (e.g. more profane language, numerals and beer mentions) and women (e.g. more pronouns, emoticons and character flooding). Many of our findings corroborate previous research on other languages.- Anthology ID:
- W17-1418
- Volume:
- Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing
- Month:
- April
- Year:
- 2017
- Address:
- Valencia, Spain
- Editors:
- Tomaž Erjavec, Jakub Piskorski, Lidia Pivovarova, Jan Šnajder, Josef Steinberger, Roman Yangarber
- Venue:
- BSNLP
- SIG:
- SIGSLAV
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 119–125
- Language:
- URL:
- https://aclanthology.org/W17-1418
- DOI:
- 10.18653/v1/W17-1418
- Cite (ACL):
- Ben Verhoeven, Iza Škrjanec, and Senja Pollak. 2017. Gender Profiling for Slovene Twitter communication: the Influence of Gender Marking, Content and Style. In Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing, pages 119–125, Valencia, Spain. Association for Computational Linguistics.
- Cite (Informal):
- Gender Profiling for Slovene Twitter communication: the Influence of Gender Marking, Content and Style (Verhoeven et al., BSNLP 2017)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-1/W17-1418.pdf