Zhengcuo Dan


2023

pdf
基于数据增强的藏文机器阅读有难度问题的生成(Difficult Question Generation of Tibetan Machine Reading Based on Data Enhancement)
Zhengcuo Dan (旦正错) | Long Chen (陈龙) | Junjie Deng (邓俊杰) | Xian Pang (庞仙) | Yuan Sun (孙媛)
Proceedings of the 22nd Chinese National Conference on Computational Linguistics

“问题生成是机器阅读理解数据集构建的子任务,指让计算机根据给定有(无)答案的上下文,生成流利通顺的问题集。在中英文领域,以端到端为基础的问题生成模型已经得到了很好的发展,并且构建了大批高质量的问答对。但是在低资源语言(藏文)领域,以机器阅读理解、智能问答系统为代表的数据驱动型任务中仍然普遍存在数据量较少和问答对过于简单的问题。因此,本文提出了三种面向藏文机器阅读的有难度问题的生成方法:(1)基于藏文预训练语言模型进行掩码、替换关键词生成不可回答问题。(2)根据相似段落的问题交叉生成不可回答的问题。(3)根据三元组生成具有知识推理的问题。最后,本文在构建的数据集上进行了实验,结果表明,包含不可回答、知识推理等类型的机器阅读理解数据集对模型的理解能力提出了更高的要求。另外,对构建的不可回答问题,从数据集的可读性、关联性和可回答性三个层面验证了数据集的质量。”

2022

pdf
Question Generation Based on Grammar Knowledge and Fine-grained Classification
Yuan Sun | Sisi Liu | Zhengcuo Dan | Xiaobing Zhao
Proceedings of the 29th International Conference on Computational Linguistics

Question generation is the task of automatically generating questions based on given context and answers, and there are problems that the types of questions and answers do not match. In minority languages such as Tibetan, since the grammar rules are complex and the training data is small, the related research on question generation is still in its infancy. To solve the above problems, this paper constructs a question type classifier and a question generator. We perform fine-grained division of question types and integrate grammatical knowledge into question type classifiers to improve the accuracy of question types. Then, the types predicted by the question type classifier are fed into the question generator. Our model improves the accuracy of interrogative words in generated questions, and the BLEU-4 on SQuAD reaches 17.52, the BLEU-4 on HotpotQA reaches 19.31, the BLEU-4 on TibetanQA reaches 25.58.

2021

pdf
面向机器阅读理解的高质量藏语数据集构建(Construction of High-quality Tibetan Dataset for Machine Reading Comprehension)
Yuan Sun (孙媛) | Sisi Liu (刘思思) | Chaofan Chen (陈超凡) | Zhengcuo Dan (旦正错) | Xiaobing Zhao (赵小兵)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

机器阅读理解是通过算法让机器根据给定的上下文回答问题,从而测试机器理解自然语言的程度。其中,数据集的构建是机器阅读理解的主要任务。目前,相关算法模型在大多数流行的英语数据集上都取得了显著的成绩,甚至超过了人类的表现。但对于低资源语言,由于缺乏相应的数据集,机器阅读理解研究还处于起步阶段。本文以藏语为例,人工构建了藏语机器阅读理解数据集(TibetanQA),其中包含20000个问题答案对和1513篇文章。本数据集的文章均来自云藏网,涵盖了自然、文化和教育等12个领域的知识,问题形式多样且具有一定的难度。另外,该数据集在文章收集、问题构建、答案验证、回答多样性和推理能力等方面,均采用严格的流程以确保数据的质量,同时采用基于语言特征消融输入的验证方法说明了数据集的质量。最后,本文初步探索了三种经典的英语阅读理解模型在TibetanQA数据集上的表现,其结果难以媲美人类,这表明在藏语机器阅读理解任务上还需要更进一步的探索。