Piecing Together Cross-Document Coreference Resolution Datasets: Systematic Dataset Analysis and Unification

Anastasia Zhukova; Terry Ruas; Jan Philip Wahle; Bela Gipp

Piecing Together Cross-Document Coreference Resolution Datasets: Systematic Dataset Analysis and Unification

Anastasia Zhukova, Terry Lima Ruas, Jan Philip Wahle, Bela Gipp

Abstract

Work in Natural Language Understanding increasingly relies on the ability to identify and track entities and events across large, heterogeneous text collections. This task, known as cross-document coreference resolution (CDCR), has a wide range of downstream applications, including multi-document summarization, information retrieval, and knowledge base population. Research in this area remains fragmented due to heterogeneous dataset formats, varying annotation standards, and the predominance of the CDCR definition as the event coreference resolution (ECR). To address these challenges, we introduce uCDCR, a unified dataset that consolidates diverse publicly available English CDCR corpora across various domains into a consistent format, which we analyze with standardized metrics and evaluation protocols. uCDCR incorporates both entity and event coreference, corrects known inconsistencies, and enriches datasets with missing attributes to facilitate reproducible research. We establish a cohesive framework for fair, interpretable, and cross-dataset analysis in CDCR and compare the datasets on their lexical properties, e.g., lexical composition of the annotated mentions, lexical diversity and ambiguity metrics, discuss the annotation rules and principles that lead to high lexical diversity, and examine how these metrics influence performance on the same-head-lemma baseline. Our dataset analysis shows that ECB+, the state-of-the-art benchmark for CDCR, has one of the lowest lexical diversities, and its CDCR complexity, measured by the same-head-lemma baseline, lies in the middle among all uCDCR datasets. Moreover, comparing document and mention distributions between ECB+ and uCDCR shows that using all uCDCR datasets for model training and evaluation will improve the generalizability of CDCR models. Finally, the almost identical performance on the same-head-lemma baseline, separately applied to events and entities, shows that resolving both types is a complex task and should not be steered toward ECR alone. The uCDCR dataset is available at https://huggingface.co/datasets/AnZhu/uCDCR, and the code for parsing, analyzing, and scoring the dataset is available at https://github.com/anastasia-zhukova/uCDCR.

Anthology ID:: 2026.lrec-main.328
Volume:: Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:: May
Year:: 2026
Address:: Palma de Mallorca, Spain
Editors:: Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:: LREC
SIG:
Publisher:: ELRA Language Resource Association
Note:
Pages:: 4152–4172
Language:
URL:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.328/
DOI:
Bibkey:
Cite (ACL):: Anastasia Zhukova, Terry Lima Ruas, Jan Philip Wahle, and Bela Gipp. 2026. Piecing Together Cross-Document Coreference Resolution Datasets: Systematic Dataset Analysis and Unification. International Conference on Language Resources and Evaluation, main:4152–4172.
Cite (Informal):: Piecing Together Cross-Document Coreference Resolution Datasets: Systematic Dataset Analysis and Unification (Zhukova et al., LREC 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.328.pdf

PDF Cite Search Fix data