Abstract
Gender differences in language use have long been of interest in linguistics. The task of automatic gender attribution has been considered in computational linguistics as well. Most research of this type is done using (usually English) texts with authorship metadata. In this paper, we propose a new method of male/female corpus creation based on gender-specific first-person expressions. The method was applied on CommonCrawl Web corpus for Polish (language, in which gender-revealing first-person expressions are particularly frequent) to yield a large (780M words) and varied collection of men’s and women’s texts. The whole procedure for building the corpus and filtering out unwanted texts is described in the present paper. The quality check was done on a random sample of the corpus to make sure that the majority (84%) of texts are correctly attributed, natural texts. Some preliminary (socio)linguistic insights (websites and words frequently occurring in male/female fragments) are given as well.- Anthology ID:
- L16-1648
- Volume:
- Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
- Month:
- May
- Year:
- 2016
- Address:
- Portorož, Slovenia
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 4105–4110
- Language:
- URL:
- https://aclanthology.org/L16-1648
- DOI:
- Cite (ACL):
- Filip Graliński, Łukasz Borchmann, and Piotr Wierzchoń. 2016. “He Said She Said” ― a Male/Female Corpus of Polish. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 4105–4110, Portorož, Slovenia. European Language Resources Association (ELRA).
- Cite (Informal):
- “He Said She Said” ― a Male/Female Corpus of Polish (Graliński et al., LREC 2016)
- PDF:
- https://preview.aclanthology.org/nodalida-main-page/L16-1648.pdf