基于词向量的自适应领域术语抽取方法(An Adaptive Domain-Specific Terminology Extraction Approach Based on Word Embedding)
Abstract
“术语分布呈现长尾特性。为了有效提取低频术语,本文提出了一种基于词向量的自适应术语抽取方法。该方法使用基于假设检验的统计方法,自适应地确定筛选阈值,通过逐步合并文本的强关联性字符串获得候选术语,避免了因固定阈值导致的低频术语遗漏问题;其后,本文基于掩码语言模型获得未登录候选术语的词向量,并通过融合词典知识的密度聚类算法获得候选术语归属的领域簇,将归属于目标领域簇的候选术语认定为领域术语。实验结果表明,我们的方法不仅在但值上优于对比方法,而且在不同体裁的文本中表现更为稳定。该方法能够全面有效地抽取出低频术语,实现领域术语的高质量提取。”- Anthology ID:
- 2023.ccl-1.17
- Volume:
- Proceedings of the 22nd Chinese National Conference on Computational Linguistics
- Month:
- August
- Year:
- 2023
- Address:
- Harbin, China
- Editors:
- Maosong Sun, Bing Qin, Xipeng Qiu, Jing Jiang, Xianpei Han
- Venue:
- CCL
- SIG:
- Publisher:
- Chinese Information Processing Society of China
- Note:
- Pages:
- 186–195
- Language:
- Chinese
- URL:
- https://aclanthology.org/2023.ccl-1.17
- DOI:
- Cite (ACL):
- Xi Tang, Dongchen Jiang, and Aoyuan Jiang. 2023. 基于词向量的自适应领域术语抽取方法(An Adaptive Domain-Specific Terminology Extraction Approach Based on Word Embedding). In Proceedings of the 22nd Chinese National Conference on Computational Linguistics, pages 186–195, Harbin, China. Chinese Information Processing Society of China.
- Cite (Informal):
- 基于词向量的自适应领域术语抽取方法(An Adaptive Domain-Specific Terminology Extraction Approach Based on Word Embedding) (Tang et al., CCL 2023)
- PDF:
- https://preview.aclanthology.org/teach-a-man-to-fish/2023.ccl-1.17.pdf