Annotating the Enron Email Corpus with Number Senses

Stuart Moore, Sabine Buchholz, Anna Korhonen


Abstract
The Enron Email Corpus provides ``Real World'' text in the business email domain, which is a target domain for many speech and language applications. We present a section of this corpus annotated with number senses - labelling each number as a date, time, year, telephone number etc. We show that sense categories and their frequencies are very different in this domain than in newswire text. The annotated corpus can provide valuable material for the development of number sense disambiguation techniques. We have released the annotations into the public domain, to allow other researchers to perform comparisons.
Anthology ID:
L10-1444
Volume:
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Month:
May
Year:
2010
Address:
Valletta, Malta
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/653_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Stuart Moore, Sabine Buchholz, and Anna Korhonen. 2010. Annotating the Enron Email Corpus with Number Senses. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta. European Language Resources Association (ELRA).
Cite (Informal):
Annotating the Enron Email Corpus with Number Senses (Moore et al., LREC 2010)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/653_Paper.pdf