Gender Swapping as a Data Augmentation Technique: Developing Gender-Balanced Datasets for Ukrainian Language Processing

Olha Nahurna, Mariana Romanyshyn


Abstract
This paper presents a pipeline for generating gender-balanced datasets through sentence-level gender swapping, addressing the gender-imbalance issue in Ukrainian texts. We select sentences with gender-marked entities, focusing on job titles, generate their inverted alternatives using LLMs and human-in-the-loop, and fine-tune Aya-101 on the resulting dataset for the task of gender swapping. Additionally, we train a Named Entity Recognition (NER) model on gender-balanced data, demonstrating its ability to better recognize gendered entities. The findings unveil the potential of gender-balanced datasets to enhance model robustness and support more fair language processing. Finally, we make a gender-swapped version of NER-UK~2.0 and the fine-tuned Aya-101 model available for download and further research.
Anthology ID:
2025.unlp-1.16
Volume:
Proceedings of the Fourth Ukrainian Natural Language Processing Workshop (UNLP 2025)
Month:
July
Year:
2025
Address:
Vienna, Austria (online)
Editor:
Mariana Romanyshyn
Venues:
UNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
147–161
Language:
URL:
https://preview.aclanthology.org/acl25-workshop-ingestion/2025.unlp-1.16/
DOI:
Bibkey:
Cite (ACL):
Olha Nahurna and Mariana Romanyshyn. 2025. Gender Swapping as a Data Augmentation Technique: Developing Gender-Balanced Datasets for Ukrainian Language Processing. In Proceedings of the Fourth Ukrainian Natural Language Processing Workshop (UNLP 2025), pages 147–161, Vienna, Austria (online). Association for Computational Linguistics.
Cite (Informal):
Gender Swapping as a Data Augmentation Technique: Developing Gender-Balanced Datasets for Ukrainian Language Processing (Nahurna & Romanyshyn, UNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/acl25-workshop-ingestion/2025.unlp-1.16.pdf