Maria Ferragud
2026
GOLEMcoref: A Multilingual Coreference Dataset of Fiction
Andreas Van Cranenburgh | Xiaoyan Yang | Alvanita | Cecilia Nicole Di Domenico | Maria Ferragud | Arianna Graciotti | Byungjun Kim | Seonyeong Park | Noa Visser Solissa | Xiaoyu Zhou | Federico Pianzola
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Andreas Van Cranenburgh | Xiaoyan Yang | Alvanita | Cecilia Nicole Di Domenico | Maria Ferragud | Arianna Graciotti | Byungjun Kim | Seonyeong Park | Noa Visser Solissa | Xiaoyu Zhou | Federico Pianzola
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
We present a multilingual coreference dataset of 827k tokens of fiction in 7 languages: Bahasa Indonesia, Chinese, Dutch, English, Italian, Korean, and Spanish. The dataset includes full stories of diverse lengths, ranging from 500 to 17k words. We discuss our annotation scheme focusing on characters and language-specific challenges we encountered. Finally we present evaluation results of a neural coreference system trained on our dataset. We show that jointly training a system across all languages provides a strong improvement over monolingually trained models. The dataset is available under a creative commons license in CoNLL-2012 and CorefUD format at https://github.com/GOLEM-lab/GOLEMcoref/