中文句子级性别无偏数据集构建及预训练语言模型的性别偏度评估(Construction of Chinese Sentence-Level Gender-Unbiased Data Set and Evaluation of Gender Bias in Pre-Training Language)

Jishun Zhao (赵继舜), Bingjie Du (杜冰洁), Shucheng Zhu (朱述承), Pengyuan Liu (刘鹏远)


Abstract
自然语言处理领域各项任务中,模型广泛存在性别偏见。然而当前尚无中文性别偏见评估和消偏的相关数据集,因此无法对中文自然语言处理模型中的性别偏见进行评估。首先本文根据16对性别称谓词,从一个平面媒体语料库中筛选出性别无偏的句子,构建了一个含有20000条语句的中文句子级性别无偏数据集SlguSet。随后,本文提出了一个可衡量预训练语言模型性别偏见程度的指标,并对5种流行的预训练语言模型中的性别偏见进行评估。结果表明,中文预训练语言模型中存在不同程度的性别偏见,该文所构建数据集能够很好的对中文预训练语言模型中的性别偏见进行评估。同时,该数据集还可作为评估预训练语言模型消偏方法的数据集。
Anthology ID:
2021.ccl-1.51
Volume:
Proceedings of the 20th Chinese National Conference on Computational Linguistics
Month:
August
Year:
2021
Address:
Huhhot, China
Editors:
Sheng Li (李生), Maosong Sun (孙茂松), Yang Liu (刘洋), Hua Wu (吴华), Kang Liu (刘康), Wanxiang Che (车万翔), Shizhu He (何世柱), Gaoqi Rao (饶高琦)
Venue:
CCL
SIG:
Publisher:
Chinese Information Processing Society of China
Note:
Pages:
564–575
Language:
Chinese
URL:
https://aclanthology.org/2021.ccl-1.51
DOI:
Bibkey:
Cite (ACL):
Jishun Zhao, Bingjie Du, Shucheng Zhu, and Pengyuan Liu. 2021. 中文句子级性别无偏数据集构建及预训练语言模型的性别偏度评估(Construction of Chinese Sentence-Level Gender-Unbiased Data Set and Evaluation of Gender Bias in Pre-Training Language). In Proceedings of the 20th Chinese National Conference on Computational Linguistics, pages 564–575, Huhhot, China. Chinese Information Processing Society of China.
Cite (Informal):
中文句子级性别无偏数据集构建及预训练语言模型的性别偏度评估(Construction of Chinese Sentence-Level Gender-Unbiased Data Set and Evaluation of Gender Bias in Pre-Training Language) (Zhao et al., CCL 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-1/2021.ccl-1.51.pdf
Data
GAP Coreference Dataset