Cross-domain Author Gender Classification in Brazilian Portuguese

Rafael Dias, Ivandré Paraboni


Abstract
Author profiling models predict demographic characteristics of a target author based on the text that they have written. Systems of this kind will often follow a single-domain approach, in which the model is trained from a corpus of labelled texts in a given domain, and it is subsequently validated against a test corpus built from precisely the same domain. Although single-domain settings are arguably ideal, this strategy gives rise to the question of how to proceed when no suitable training corpus (i.e., a corpus that matches the test domain) is available. To shed light on this issue, this paper discusses a cross-domain gender classification task based on four domains (Facebook, crowd sourced opinions, Blogs and E-gov requests) in the Brazilian Portuguese language. A number of simple gender classification models using word- and psycholinguistics-based features alike are introduced, and their results are compared in two kinds of cross-domain setting: first, by making use of a single text source as training data for each task, and subsequently by combining multiple sources. Results confirm previous findings related to the effects of corpus size and domain similarity in English, and pave the way for further studies in the field.
Anthology ID:
2020.lrec-1.154
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1227–1234
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.154
DOI:
Bibkey:
Cite (ACL):
Rafael Dias and Ivandré Paraboni. 2020. Cross-domain Author Gender Classification in Brazilian Portuguese. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1227–1234, Marseille, France. European Language Resources Association.
Cite (Informal):
Cross-domain Author Gender Classification in Brazilian Portuguese (Dias & Paraboni, LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2020.lrec-1.154.pdf