Hongying Zan

Also published as: Hong-ying Zan


MMDAG: Multimodal Directed Acyclic Graph Network for Emotion Recognition in Conversation
Shuo Xu | Yuxiang Jia | Changyong Niu | Hongying Zan
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Emotion recognition in conversation is important for an empathetic dialogue system to understand the user’s emotion and then generate appropriate emotional responses. However, most previous researches focus on modeling conversational contexts primarily based on the textual modality or simply utilizing multimodal information through feature concatenation. In order to exploit multimodal information and contextual information more effectively, we propose a multimodal directed acyclic graph (MMDAG) network by injecting information flows inside modality and across modalities into the DAG architecture. Experiments on IEMOCAP and MELD show that our model outperforms other state-of-the-art models. Comparative studies validate the effectiveness of the proposed modality fusion method.

CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
Ningyu Zhang | Mosha Chen | Zhen Bi | Xiaozhuan Liang | Lei Li | Xin Shang | Kangping Yin | Chuanqi Tan | Jian Xu | Fei Huang | Luo Si | Yuan Ni | Guotong Xie | Zhifang Sui | Baobao Chang | Hui Zong | Zheng Yuan | Linfeng Li | Jun Yan | Hongying Zan | Kunli Zhang | Buzhou Tang | Qingcai Chen
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Artificial Intelligence (AI), along with the recent progress in biomedical language understanding, is gradually offering great promise for medical practice. With the development of biomedical language understanding benchmarks, AI applications are widely used in the medical field. However, most benchmarks are limited to English, which makes it challenging to replicate many of the successes in English for other languages. To facilitate research in this direction, we collect real-world biomedical data and present the first Chinese Biomedical Language Understanding Evaluation (CBLUE) benchmark: a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification, and an associated online platform for model evaluation, comparison, and analysis. To establish evaluation on these tasks, we report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.

ParaZh-22M: A Large-Scale Chinese Parabank via Machine Translation
Wenjie Hao | Hongfei Xu | Deyi Xiong | Hongying Zan | Lingling Mu
Proceedings of the 29th International Conference on Computational Linguistics

Paraphrasing, i.e., restating the same meaning in different ways, is an important data augmentation approach for natural language processing (NLP). Zhang et al. (2019b) propose to extract sentence-level paraphrases from multiple Chinese translations of the same source texts, and construct the PKU Paraphrase Bank of 0.5M sentence pairs. However, despite being the largest Chinese parabank to date, the size of PKU parabank is limited by the availability of one-to-many sentence translation data, and cannot well support the training of large Chinese paraphrasers. In this paper, we relieve the restriction with one-to-many sentence translation data, and construct ParaZh-22M, a larger Chinese parabank that is composed of 22M sentence pairs, based on one-to-one bilingual sentence translation data and machine translation (MT). In our data augmentation experiments, we show that paraphrasing based on ParaZh-22M can bring about consistent and significant improvements over several strong baselines on a wide range of Chinese NLP tasks, including a number of Chinese natural language understanding benchmarks (CLUE) and low-resource machine translation.

期货领域知识图谱构建(Construction of Knowledge Graph in Futures Field)
Wenxin Li (李雯昕) | Hongying Zan (昝红英) | Tongfeng Guan (关同峰) | Yingjie Han (韩英杰)
Proceedings of the 21st Chinese National Conference on Computational Linguistics

“期货领域是数据最丰富的领域之一,本文以商品期货的研究报告为数据来源构建了期货领域知识图谱(Commodity Futures Knowledge Graph,CFKG)。以期货产品为核心,确立了概念分类体系及关系描述体系,形成图谱的概念层;在MHS-BIA与GPN模型的基础上,通过领域专家指导对242万字的研报文本进行标注与校对,形成了CFKG数据层,并设计了可视化查询系统。所构建的CFKG包含17,003个农产品期货关系三元组、13,703种非农产品期货关系三元组,为期货领域文本分析、舆情监控和推理决策等应用提供知识支持。”


融入篇章信息的文学作品命名实体识别(Document-level Literary Named Entity Recognition)
Yuxiang Jia (贾玉祥) | Rui Chao (晁睿) | Hongying Zan (昝红英) | Huayi Dou (窦华溢) | Shuai Cao (曹帅) | Shuo Xu (徐硕)
Proceedings of the 20th Chinese National Conference on Computational Linguistics


糖尿病电子病历实体及关系标注语料库构建(Construction of Corpus for Entity and Relation Annotation of Diabetes Electronic Medical Records)
Yajuan Ye (叶娅娟) | Bin Hu (胡斌) | Kunli Zhang (张坤丽) | Hongying Zan (昝红英)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

电子病历是医疗信息的重要来源,包含大量与医疗相关的领域知识。本文从糖尿病电子病历文本入手,在调研了国内外已有的电子病历语料库的基础上,参考坉圲坂圲实体及关系分类,建立了糖尿病电子病历实体及实体关系分类体系,并制定了标注规范。利用实体及关系标注平台,进行了实体及关系预标注及多轮人工校对工作,形成了糖尿病电子病历实体及关系标注语料库(Diabetes Electronic Medical Record entity and Related Corpus DEMRC)。所构建的DEMRC包含8899个实体、456个实体修饰及16564个关系。对DEMRC进行一致性评价和分析,标注结果达到了较高的一致性。针对实体识别和实体关系抽取任务,分别采用基于迁移学习的Bi-LSTM-CRF模型和RoBERTa模型进行初步实验,并对语料库中的各类实体及关系进行评估,为后续糖尿病电子病历实体识别及关系抽取研究以及糖尿病知识图谱构建打下基础。

脑卒中疾病电子病历实体及实体关系标注语料库构建(Corpus Construction for Named-Entity and Entity Relations for Electronic Medical Records of Stroke Disease)
Hongyang Chang (常洪阳) | Hongying Zan (昝红英) | Yutuan Ma (马玉团) | Kunli Zhang (张坤丽)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

本文探讨了在脑卒中疾病中文电子病历文本中实体及实体间关系的标注问题,提出了适用于脑卒中疾病电子病历文本的实体及实体关系标注体系和规范。在标注体系和规范的指导下,进行了多轮的人工标注及校正工作,完成了158万余字的脑卒中电子病历文本实体及实体关系的标注工作。构建了脑卒中电子病历实体及实体关系标注语料库(Stroke Electronic Medical Record entity and entity related Corpus SEMRC)。所构建的语料库共包含命名实体10594个,实体关系14457个。实体名标注一致率达到85.16%,实体关系标注一致率达到94.16%。

Self-Supervised Curriculum Learning for Spelling Error Correction
Zifa Gan | Hongfei Xu | Hongying Zan
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Spelling Error Correction (SEC) that requires high-level language understanding is a challenging but useful task. Current SEC approaches normally leverage a pre-training then fine-tuning procedure that treats data equally. By contrast, Curriculum Learning (CL) utilizes training data differently during training and has shown its effectiveness in improving both performance and training efficiency in many other NLP tasks. In NMT, a model’s performance has been shown sensitive to the difficulty of training examples, and CL has been shown effective to address this. In SEC, the data from different language learners are naturally distributed at different difficulty levels (some errors made by beginners are obvious to correct while some made by fluent speakers are hard), and we expect that designing a curriculum correspondingly for model learning may also help its training and bring about better performance. In this paper, we study how to further improve the performance of the state-of-the-art SEC method with CL, and propose a Self-Supervised Curriculum Learning (SSCL) approach. Specifically, we directly use the cross-entropy loss as criteria for: 1) scoring the difficulty of training data, and 2) evaluating the competence of the model. In our approach, CL improves the model training, which in return improves the CL measurement. In our experiments on the SIGHAN 2015 Chinese spelling check task, we show that SSCL is superior to previous norm-based and uncertainty-aware approaches, and establish a new state of the art (74.38% F1).


面向医学文本处理的医学实体标注规范(Medical Entity Annotation Standard for Medical Text Processing)
Huan Zhang (张欢) | Yuan Zong (宗源) | Baobao Chang (常宝宝) | Zhifang Sui (穗志方) | Hongying Zan (昝红英) | Kunli Zhang (张坤丽)
Proceedings of the 19th Chinese National Conference on Computational Linguistics


Konwledge-Enabled Diagnosis Assistant Based on Obstetric EMRs and Knowledge Graph
Kunli Zhang | Xu Zhao | Lei Zhuang | Qi Xie | Hongying Zan
Proceedings of the 19th Chinese National Conference on Computational Linguistics

The obstetric Electronic Medical Record (EMR) contains a large amount of medical data and health information. It plays a vital role in improving the quality of the diagnosis assistant service. In this paper, we treat the diagnosis assistant as a multi-label classification task and propose a Knowledge-Enabled Diagnosis Assistant (KEDA) model for the obstetric diagnosis assistant. We utilize the numerical information in EMRs and the external knowledge from Chinese Obstetric Knowledge Graph (COKG) to enhance the text representation of EMRs. Specifically, the bidirectional maximum matching method and similarity-based approach are used to obtain the entities set contained in EMRs and linked to the COKG. The final knowledge representation is obtained by a weight-based disease prediction algorithm, and it is fused with the text representation through a linear weighting method. Experiment results show that our approach can bring about +3.53 F1 score improvements upon the strong BERT baseline in the diagnosis assistant task.

Chinese Grammatical Error Diagnosis Based on RoBERTa-BiLSTM-CRF Model
Yingjie Han | Yingjie Yan | Yangchao Han | Rui Chao | Hongying Zan
Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications

Chinese Grammatical Error Diagnosis (CGED) is a natural language processing task for the NLPTEA6 workshop. The goal of this task is to automatically diagnose grammatical errors in Chinese sentences written by L2 learners. This paper proposes a RoBERTa-BiLSTM-CRF model to detect grammatical errors in sentences. Firstly, RoBERTa model is used to obtain word vectors. Secondly, word vectors are input into BiLSTM layer to learn context features. Last, CRF layer without hand-craft features work for processing the output by BiLSTM. The optimal global sequences are obtained according to state transition matrix of CRF and adjacent labels of training data. In experiments, the result of RoBERTa-CRF model and ERNIE-BiLSTM-CRF model are compared, and the impacts of parameters of the models and the testing datasets are analyzed. In terms of evaluation results, our recall score of RoBERTa-BiLSTM-CRF ranks fourth at the detection level.

Chinese Grammatical Errors Diagnosis System Based on BERT at NLPTEA-2020 CGED Shared Task
Hongying Zan | Yangchao Han | Haotian Huang | Yingjie Yan | Yuke Wang | Yingjie Han
Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications

In the process of learning Chinese, second language learners may have various grammatical errors due to the negative transfer of native language. This paper describes our submission to the NLPTEA 2020 shared task on CGED. We present a hybrid system that utilizes both detection and correction stages. The detection stage is a sequential labelling model based on BiLSTM-CRF and BERT contextual word representation. The correction stage is a hybrid model based on the n-gram and Seq2Seq. Without adding additional features and external data, the BERT contextual word representation can effectively improve the performance metrics of Chinese grammatical error detection and correction.


Detecting Simultaneously Chinese Grammar Errors Based on a BiLSTM-CRF Model
Yajun Liu | Hongying Zan | Mengjie Zhong | Hongchao Ma
Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications

In the process of learning and using Chinese, many learners of Chinese as foreign language(CFL) may have grammar errors due to negative migration of their native languages. This paper introduces our system that can simultaneously diagnose four types of grammatical errors including redundant (R), missing (M), selection (S), disorder (W) in NLPTEA-5 shared task. We proposed a Bidirectional LSTM CRF neural network (BiLSTM-CRF) that combines BiLSTM and CRF without hand-craft features for Chinese Grammatical Error Diagnosis (CGED). Evaluation includes three levels, which are detection level, identification level and position level. At the detection level and identification level, our system got the third recall scores, and achieved good F1 values.

Research on Entity Relation Extraction for Military Field
Chen Liang | Hongying Zan | Yajun Liu | Yunfang Wu
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation


Automatic Grammatical Error Detection for Chinese based on Conditional Random Field
Yajun Liu | Yingjie Han | Liyan Zhuo | Hongying Zan
Proceedings of the 3rd Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA2016)

In the process of learning and using Chinese, foreigners may have grammatical errors due to negative migration of their native languages. Currently, the computer-oriented automatic detection method of grammatical errors is not mature enough. Based on the evaluating task — CGED2016, we select and analyze the classification model and design feature extraction method to obtain grammatical errors including Mission(M), Disorder(W), Selection (S) and Redundant (R) automatically. The experiment results based on the dynamic corpus of HSK show that the Chinese grammatical error automatic detection method, which uses CRF as classification model and n-gram as feature extraction method. It is simple and efficient which play a positive effect on the research of Chinese grammatical error automatic detection and also a supporting and guiding role in the teaching of Chinese as a foreign language.


A Comparison of Chinese Word Segmentation on News and Microblog Corpora with a Lexicon Based Method
Yuxiang Jia | Hongying Zan | Ming Fan | Zhimin Wang
Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing

Chinese Personal Name Disambiguation Based on Vector Space Model
Qing-hu Fan | Hong-ying Zan | Yu-mei Chai | Yu-xiang Jia | Gui-ling Niu
Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing


Studies on Automatic Recognition of Common Chinese Adverb’s usages Based on Statistics Methods
Hongying Zan | Junhui Zhang | Xuefeng Zhu | Shiwen Yu
CIPS-SIGHAN Joint Conference on Chinese Language Processing