NovelCR: A Large-Scale Bilingual Dataset Tailored for Long-Span Coreference Resolution

MeiHan Tong, Shuai Wang


Abstract
Coreference resolution (CR) endeavors to match pronouns, noun phrases, etc. with their referent entities, acting as an important step for deep text understanding. Presently available CR datasets are either small in scale or restrict coreference resolution to a limited text span. In this paper, we present NovelCR, a large-scale bilingual benchmark designed for long-span coreference resolution. NovelCR features extensive annotations, including 148k mentions in NovelCR-en and 311k mentions in NovelCR-zh. Moreover, the dataset is notably rich in long-span coreference pairs, with 85% of pairs in NovelCR-en and 83% in NovelCR-zh spanning across three or more sentences. Experiments on NovelCR reveal a large gap between state-of-the-art baselines and human performance, highlighting that NovelCR remains an open issue.
Anthology ID:
2025.findings-acl.268
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5161–5173
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.findings-acl.268/
DOI:
Bibkey:
Cite (ACL):
MeiHan Tong and Shuai Wang. 2025. NovelCR: A Large-Scale Bilingual Dataset Tailored for Long-Span Coreference Resolution. In Findings of the Association for Computational Linguistics: ACL 2025, pages 5161–5173, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
NovelCR: A Large-Scale Bilingual Dataset Tailored for Long-Span Coreference Resolution (Tong & Wang, Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.findings-acl.268.pdf