2024
pdf
bib
abs
基于文本风格迁移的中文性别歧视文本去毒研究(Research on detoxification of Chinese sexist texts based on text style transfer)
Jian Peng (彭健)
|
Jiali Zuo (左家莉)
|
Jingxuan Tan (谭景璇)
|
Jianyi Wan (万剑怡)
|
Mingwen Wang (王明文)
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)
“网络社交媒体平台存在一定程度的性别歧视言论,阻碍了互联网健康和社会文明发展。文本风格迁移技术可以减轻文本中的性别歧视,在英语等语言上已有不少研究。但在中文领域,由于缺乏数据集而导致相关研究较少。此外,由于中文语义信息丰富、语言表达多样而导致性别歧视言论毒性的表现形式多样,现有的方法多采用单一文本风格迁移模型因而效果不佳。因此,本文提出了一个基于文本风格迁移的中文性别歧视文本去毒框架,该框架首先根据毒性的表现形式对文本进行分类,进而根据文本毒性表现形式的不同采用不同的处理方式,我们还引入了大语言模型(LLM)构建歧视词词典。实验表明,本文提出的模型能有效地处理中文文本中的性别歧视问题。”
2018
pdf
bib
abs
Efficient Contextualized Representation: Language Model Pruning for Sequence Labeling
Liyuan Liu
|
Xiang Ren
|
Jingbo Shang
|
Xiaotao Gu
|
Jian Peng
|
Jiawei Han
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Many efforts have been made to facilitate natural language processing tasks with pre-trained language models (LMs), and brought significant improvements to various applications. To fully leverage the nearly unlimited corpora and capture linguistic information of multifarious levels, large-size LMs are required; but for a specific task, only parts of these information are useful. Such large-sized LMs, even in the inference stage, may cause heavy computation workloads, making them too time-consuming for large-scale applications. Here we propose to compress bulky LMs while preserving useful information with regard to a specific task. As different layers of the model keep different information, we develop a layer selection method for model pruning using sparsity-inducing regularization. By introducing the dense connectivity, we can detach any layer without affecting others, and stretch shallow and wide LMs to be deep and narrow. In model training, LMs are learned with layer-wise dropouts for better robustness. Experiments on two benchmark datasets demonstrate the effectiveness of our method.
pdf
bib
abs
SemRegex: A Semantics-Based Approach for Generating Regular Expressions from Natural Language Specifications
Zexuan Zhong
|
Jiaqi Guo
|
Wei Yang
|
Jian Peng
|
Tao Xie
|
Jian-Guang Lou
|
Ting Liu
|
Dongmei Zhang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Recent research proposes syntax-based approaches to address the problem of generating programs from natural language specifications. These approaches typically train a sequence-to-sequence learning model using a syntax-based objective: maximum likelihood estimation (MLE). Such syntax-based approaches do not effectively address the goal of generating semantically correct programs, because these approaches fail to handle Program Aliasing, i.e., semantically equivalent programs may have many syntactically different forms. To address this issue, in this paper, we propose a semantics-based approach named SemRegex. SemRegex provides solutions for a subtask of the program-synthesis problem: generating regular expressions from natural language. Different from the existing syntax-based approaches, SemRegex trains the model by maximizing the expected semantic correctness of the generated regular expressions. The semantic correctness is measured using the DFA-equivalence oracle, random test cases, and distinguishing test cases. The experiments on three public datasets demonstrate the superiority of SemRegex over the existing state-of-the-art approaches.
pdf
bib
abs
emrQA: A Large Corpus for Question Answering on Electronic Medical Records
Anusri Pampari
|
Preethi Raghavan
|
Jennifer Liang
|
Jian Peng
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
We propose a novel methodology to generate domain-specific large-scale question answering (QA) datasets by re-purposing existing annotations for other NLP tasks. We demonstrate an instance of this methodology in generating a large-scale QA dataset for electronic medical records by leveraging existing expert annotations on clinical notes for various NLP tasks from the community shared i2b2 datasets. The resulting corpus (emrQA) has 1 million questions-logical form and 400,000+ question-answer evidence pairs. We characterize the dataset and explore its learning potential by training baseline models for question to logical form and question to answer mapping.