Zheng Ye

2022

pdf abs
基于相似度进行句子选择的机器阅读理解数据增强(Machine reading comprehension data Augmentation for sentence selection based on similarity)
Shuang Nie (聂双) | Zheng Ye (叶正) | Jun Qin (覃俊) | Jing Liu (刘晶)
Proceedings of the 21st Chinese National Conference on Computational Linguistics

“目前常见的机器阅读理解数据增强方法如回译,单独对文章或者问题进行数据增强,没有考虑文章、问题和选项三元组之间的联系。因此,本文探索了一种利用三元组联系进行文章句子筛选的数据增强方法,通过比较文章与问题以及选项的相似度,选取文章中与二者联系紧密的句子。同时为了使不同选项的三元组区别增大,我们选用了正则化Dropout的策略。实验结果表明,在RACE数据集上的准确率可提高3.8%。”

2021

pdf abs
Towards Quantifiable Dialogue Coherence Evaluation
Zheng Ye | Liucun Lu | Lishan Huang | Liang Lin | Xiaodan Liang
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Automatic dialogue coherence evaluation has attracted increasing attention and is crucial for developing promising dialogue systems. However, existing metrics have two major limitations: (a) they are mostly trained in a simplified two-level setting (coherent vs. incoherent), while humans give Likert-type multi-level coherence scores, dubbed as “quantifiable”; (b) their predicted coherence scores cannot align with the actual human rating standards due to the absence of human guidance during training. To address these limitations, we propose Quantifiable Dialogue Coherence Evaluation (QuantiDCE), a novel framework aiming to train a quantifiable dialogue coherence metric that can reflect the actual human rating standards. Specifically, QuantiDCE includes two training stages, Multi-Level Ranking (MLR) pre-training and Knowledge Distillation (KD) fine-tuning. During MLR pre-training, a new MLR loss is proposed for enabling the model to learn the coarse judgement of coherence degrees. Then, during KD fine-tuning, the pretrained model is further finetuned to learn the actual human rating standards with only very few human-annotated data. To advocate the generalizability even with limited fine-tuning data, a novel KD regularization is introduced to retain the knowledge learned at the pre-training stage. Experimental results show that the model trained by QuantiDCE presents stronger correlations with human judgements than the other state-of-the-art metrics.

2020

pdf abs
GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems
Lishan Huang | Zheng Ye | Jinghui Qin | Liang Lin | Xiaodan Liang
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Automatically evaluating dialogue coherence is a challenging but high-demand ability for developing high-quality open-domain dialogue systems. However, current evaluation metrics consider only surface features or utterance-level semantics, without explicitly considering the fine-grained topic transition dynamics of dialogue flows. Here, we first consider that the graph structure constituted with topics in a dialogue can accurately depict the underlying communication logic, which is a more natural way to produce persuasive metrics. Capitalized on the topic-level dialogue graph, we propose a new evaluation metric GRADE, which stands for Graph-enhanced Representations for Automatic Dialogue Evaluation. Specifically, GRADE incorporates both coarse-grained utterance-level contextualized representations and fine-grained topic-level graph representations to evaluate dialogue coherence. The graph representations are obtained by reasoning over topic-level dialogue graphs enhanced with the evidence from a commonsense graph, including k-hop neighboring representations and hop-attention weights. Experimental results show that our GRADE significantly outperforms other state-of-the-art metrics on measuring diverse dialogue models in terms of the Pearson and Spearman correlations with human judgments. Besides, we release a new large-scale human evaluation benchmark to facilitate future research on automatic metrics.