The Arabic Parallel Gender Corpus 2.0: Extensions and Analyses

Bashar Alhafni, Nizar Habash, Houda Bouamor


Abstract
Gender bias in natural language processing (NLP) applications, particularly machine translation, has been receiving increasing attention. Much of the research on this issue has focused on mitigating gender bias in English NLP models and systems. Addressing the problem in poorly resourced, and/or morphologically rich languages has lagged behind, largely due to the lack of datasets and resources. In this paper, we introduce a new corpus for gender identification and rewriting in contexts involving one or two target users (I and/or You) – first and second grammatical persons with independent grammatical gender preferences. We focus on Arabic, a gender-marking morphologically rich language. The corpus has multiple parallel components: four combinations of 1st and 2nd person in feminine and masculine grammatical genders, as well as English, and English to Arabic machine translation output. This corpus expands on Habash et al. (2019)’s Arabic Parallel Gender Corpus (APGC v1.0) by adding second person targets as well as increasing the total number of sentences over 6.5 times, reaching over 590K words. Our new dataset will aid the research and development of gender identification, controlled text generation, and post-editing rewrite systems that could be used to personalize NLP applications and provide users with the correct outputs based on their grammatical gender preferences. We make the Arabic Parallel Gender Corpus (APGC v2.0) publicly available
Anthology ID:
2022.lrec-1.199
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1870–1884
Language:
URL:
https://aclanthology.org/2022.lrec-1.199
DOI:
Bibkey:
Cite (ACL):
Bashar Alhafni, Nizar Habash, and Houda Bouamor. 2022. The Arabic Parallel Gender Corpus 2.0: Extensions and Analyses. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1870–1884, Marseille, France. European Language Resources Association.
Cite (Informal):
The Arabic Parallel Gender Corpus 2.0: Extensions and Analyses (Alhafni et al., LREC 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-2/2022.lrec-1.199.pdf
Data
OpenSubtitles