CLEEK: A Chinese Long-text Corpus for Entity Linking

Weixin Zeng, Xiang Zhao, Jiuyang Tang, Zhen Tan, Xuqian Huang


Abstract
Entity linking, as one of the fundamental tasks in natural language processing, is crucial to knowledge fusion, knowledge base construction and update. Nevertheless, in contrast to the research on entity linking for English text, which undergoes continuous development, the Chinese counterpart is still in its infancy. One prominent issue lies in publicly available annotated datasets and evaluation benchmarks, which are lacking and deficient. In specific, existing Chinese corpora for entity linking were mainly constructed from noisy short texts, such as microblogs and news headings, where long texts were largely overlooked, which yet constitute a wider spectrum of real-life scenarios. To address the issue, in this work, we build CLEEK, a Chinese corpus of multi-domain long text for entity linking, in order to encourage advancement of entity linking in languages besides English. The corpus consists of 100 documents from diverse domains, and is publicly accessible. Moreover, we devise a measure to evaluate the difficulty of documents with respect to entity linking, which is then used to characterize the corpus. Additionally, the results of two baselines and seven state-of-the-art solutions on CLEEK are reported and compared. The empirical results validate the usefulness of CLEEK and the effectiveness of proposed difficulty measure.
Anthology ID:
2020.lrec-1.249
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
2026–2035
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.249
DOI:
Bibkey:
Cite (ACL):
Weixin Zeng, Xiang Zhao, Jiuyang Tang, Zhen Tan, and Xuqian Huang. 2020. CLEEK: A Chinese Long-text Corpus for Entity Linking. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2026–2035, Marseille, France. European Language Resources Association.
Cite (Informal):
CLEEK: A Chinese Long-text Corpus for Entity Linking (Zeng et al., LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/remove-xml-comments/2020.lrec-1.249.pdf