MCECR: A Novel Dataset for Multilingual Cross-Document Event Coreference Resolution
Amir Pouran Ben Veyseh, Viet Lai, Chien Nguyen, Franck Dernoncourt, Thien Nguyen
Abstract
Event coreference resolution (ECR) is a critical task in information extraction of natural language processing, aiming to identify and link event mentions across multiple documents. Despite recent progress, existing datasets for ECR primarily focus on within-document event coreference and English text, lacking cross-document ECR datasets for multiple languages beyond English. To address this issue, this work presents the first multiligual dataset for cross-document ECR, called MCECR (Multilingual Cross-Document Event Coreference Resolution), that manually annotates a diverse collection of documents for event mentions and coreference in five languages, i.e., English, Spanish, Hindi, Turkish, and Ukrainian. Using sampled articles from Wikinews over various topics as the seeds, our dataset fetches related news articles from the Google search engine to increase the number of non-singleton event clusters. In total, we annotate 5,802 news articles, providing a substantial and varied dataset for multilingual ECR in both within-document and cross-document scenarios. Extensive analysis of the proposed dataset reveals the challenging nature of multilingual event coreference resolution tasks, promoting MCECR as a strong benchmark dataset for future research in this area.- Anthology ID:
- 2024.findings-naacl.245
- Volume:
- Findings of the Association for Computational Linguistics: NAACL 2024
- Month:
- June
- Year:
- 2024
- Address:
- Mexico City, Mexico
- Editors:
- Kevin Duh, Helena Gomez, Steven Bethard
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 3869–3880
- Language:
- URL:
- https://aclanthology.org/2024.findings-naacl.245
- DOI:
- 10.18653/v1/2024.findings-naacl.245
- Cite (ACL):
- Amir Pouran Ben Veyseh, Viet Lai, Chien Nguyen, Franck Dernoncourt, and Thien Nguyen. 2024. MCECR: A Novel Dataset for Multilingual Cross-Document Event Coreference Resolution. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 3869–3880, Mexico City, Mexico. Association for Computational Linguistics.
- Cite (Informal):
- MCECR: A Novel Dataset for Multilingual Cross-Document Event Coreference Resolution (Pouran Ben Veyseh et al., Findings 2024)
- PDF:
- https://preview.aclanthology.org/ingest-bitext-workshop/2024.findings-naacl.245.pdf