CEREC: A Corpus for Entity Resolution in Email Conversations

Parag Pravin Dakle, Dan Moldovan


Abstract
We present the first large scale corpus for entity resolution in email conversations (CEREC). The corpus consists of 6001 email threads from the Enron Email Corpus containing 36,448 email messages and 38,996 entity coreference chains. The annotation is carried out as a two-step process with minimal manual effort. Experiments are carried out for evaluating different features and performance of four baselines on the created corpus. For the task of mention identification and coreference resolution, a best performance of 54.1 F1 is reported, highlighting the room for improvement. An in-depth qualitative and quantitative error analysis is presented to understand the limitations of the baselines considered.
Anthology ID:
2020.coling-main.30
Volume:
Proceedings of the 28th International Conference on Computational Linguistics
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
339–349
Language:
URL:
https://aclanthology.org/2020.coling-main.30
DOI:
10.18653/v1/2020.coling-main.30
Bibkey:
Cite (ACL):
Parag Pravin Dakle and Dan Moldovan. 2020. CEREC: A Corpus for Entity Resolution in Email Conversations. In Proceedings of the 28th International Conference on Computational Linguistics, pages 339–349, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Cite (Informal):
CEREC: A Corpus for Entity Resolution in Email Conversations (Dakle & Moldovan, COLING 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2020.coling-main.30.pdf
Code
 paragdakle/emailcoref
Data
CERECCoNLL-2012