Abstract
“词义消歧作为自然语言处理最经典的任务之一,旨在识别多义词在给定上下文中的正确词义。相比英文,中文的一词多义现象更普遍,然而当前公开发布的汉语词义消歧数据集很少。本文爬取并融合了两个公开的网络词典,并从中筛选1083个词语和相关义项作为待标注对象。进而,从网络数据及专业语料中为抽取相关句子。最后,以多人标注、专家审核的方式进行了人工标注。数据集1包含将近2万个句子,即每个词平均对应约20个句子。本文将数据集划分为训练集、验证集和测试集,对多种模型进行实验对比。”- Anthology ID:
- 2023.ccl-1.4
- Volume:
- Proceedings of the 22nd Chinese National Conference on Computational Linguistics
- Month:
- August
- Year:
- 2023
- Address:
- Harbin, China
- Editors:
- Maosong Sun, Bing Qin, Xipeng Qiu, Jing Jiang, Xianpei Han
- Venue:
- CCL
- SIG:
- Publisher:
- Chinese Information Processing Society of China
- Note:
- Pages:
- 43–53
- Language:
- Chinese
- URL:
- https://aclanthology.org/2023.ccl-1.4
- DOI:
- Cite (ACL):
- Fukang Yan, Yue Zhang, and Zhenghua Li. 2023. 基于网络词典的现代汉语词义消歧数据集构建(Construction of a Modern Chinese Word Sense Dataset Based on Online Dictionaries). In Proceedings of the 22nd Chinese National Conference on Computational Linguistics, pages 43–53, Harbin, China. Chinese Information Processing Society of China.
- Cite (Informal):
- 基于网络词典的现代汉语词义消歧数据集构建(Construction of a Modern Chinese Word Sense Dataset Based on Online Dictionaries) (Yan et al., CCL 2023)
- PDF:
- https://preview.aclanthology.org/ml4al-ingestion/2023.ccl-1.4.pdf