Zekun Deng


2024

pdf
CHisIEC: An Information Extraction Corpus for Ancient Chinese History
Xuemei Tang | Qi Su | Jun Wang | Zekun Deng
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Natural Language Processing (NLP) plays a pivotal role in the realm of Digital Humanities (DH) and serves as the cornerstone for advancing the structural analysis of historical and cultural heritage texts. This is particularly true for the domains of named entity recognition (NER) and relation extraction (RE). In our commitment to expediting ancient history and culture, we present the “Chinese Historical Information Extraction Corpus”(CHisIEC). CHisIEC is a meticulously curated dataset designed to develop and evaluate NER and RE tasks, offering a resource to facilitate research in the field. Spanning a remarkable historical timeline encompassing data from 13 dynasties spanning over 1830 years, CHisIEC epitomizes the extensive temporal range and text heterogeneity inherent in Chinese historical documents. The dataset encompasses four distinct entity types and twelve relation types, resulting in a meticulously labeled dataset comprising 14,194 entities and 8,609 relations. To establish the robustness and versatility of our dataset, we have undertaken comprehensive experimentation involving models of various sizes and paradigms. Additionally, we have evaluated the capabilities of Large Language Models (LLMs) in the context of tasks related to ancient Chinese history. The dataset and code are available at https://github.com/tangxuemei1995/CHisIEC.

2023

pdf
CCL23-Eval任务1总结报告:古籍命名实体识别(GuNER2023)(Overview of CCL23-Eval Task 1: Named Entity Recognition in Ancient Chinese Books)
Qi Su (祺苏,) | Yingying Wang (王莹莹) | Zekun Deng (邓泽琨) | Hao Yang (杨浩) | Jun Wang (王军)
Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations)

“第23届中国计算语言学大会(CCL)提出了中文信息处理方面的10个评测任务。其中,任务1为古籍命名实体识别评测,由北京大学数字人文研究中心、北京大学人工智能研究院组织。该任务的主要目标是自动识别古籍文本中事件基本构成要素的重要实体,以提供对古汉语文本进行分析处理的基础。评测发布了覆盖多个朝代和领域的”二十四史”评测数据集,共15万余字,包含人名、书名、官职名三种实体超万数。同时设置了封闭和开放两个赛道,聚焦于不同规格的预训练模型的应用能力。共有127支队伍报名参加了该评测任务。在封闭赛道上,参赛系统在测试集上的最佳性能达到了96.15%的F1值;在开放赛道上,最佳性能达到了95.48%的F1值。”

2022

pdf
数字人文视角下的《史记》《汉书》比较研究(A Comparative Study of Shiji and Hanshu from the Perspective of Digital Humanities)
Zekun Deng (邓泽琨) | Hao Yang (杨浩) | Jun Wang (王军)
Proceedings of the 21st Chinese National Conference on Computational Linguistics

“《史记》和《汉书》具有经久不衰的研究价值。尽管两书异同的研究已经较为丰富,但研究的全面性、完备性、科学性、客观性均仍显不足。在数字人文的视角下,本文利用计算语言学方法,通过对字、词、命名实体、段落等的多粒度、多角度分析,开展对于《史》《汉》的比较研究。首先,本文对于《史》《汉》中的字、词、命名实体的分布和特点进行对比,以遍历穷举的考察方式提炼出两书在主要内容上的相同点与不同点,揭示了汉武帝之前和汉武帝到西汉灭亡两段历史时期在政治、文化、思想上的重要变革与承袭。其次,本文使用一种融入命名实体作为外部特征的文本相似度算法对于《史记》《汉书》的异文进行自动发现,成功识别出过去研究者通过人工手段没有发现的袭用段落,使得我们对于《史》《汉》的承袭关系形成更加完整和立体的认识。再次,本文通过计算异文段落之间的最长公共子序列来自动得出两段异文之间存在的差异,从宏观统计上证明了《汉书》文字风格《史记》的差别,并从微观上进一步对二者语言特点进行了阐释,为理解《史》《汉》异文特点提供了新的角度和启发。本研究站在数字人文的视域下,利用先进的计算方法对于传世千年的中国古代经典进行了再审视、再发现,其方法对于今人研究古籍有一定的借鉴价值。”