An Empirical Study of the Occurrence and Co-Occurrence of Named Entities in Natural Language Corpora
K Saravanan, Monojit Choudhury, Raghavendra Udupa, A Kumaran
Abstract
Named Entities (NEs) that occur in natural language text are important especially due to the advent of social media, and they play a critical role in the development of many natural language technologies. In this paper, we systematically analyze the patterns of occurrence and co-occurrence of NEs in standard large English news corpora - providing valuable insight for the understanding of the corpus, and subsequently paving way for the development of technologies that rely critically on handling NEs. We use two distinctive approaches: normal statistical analysis that measure and report the occurrence patterns of NEs in terms of frequency, growth, etc., and a complex networks based analysis that measures the co-occurrence pattern in terms of connectivity, degree-distribution, small-world phenomenon, etc. Our analysis indicates that: (i) NEs form an open-set in corpora and grow linearly, (ii) presence of a kernel and peripheral NE's, with the large periphery occurring rarely, and (iii) a strong evidence of small-world phenomenon. Our findings may suggest effective ways for construction of NE lexicons to aid efficient development of several natural language technologies.- Anthology ID:
- L12-1139
- Volume:
- Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
- Month:
- May
- Year:
- 2012
- Address:
- Istanbul, Turkey
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 3118–3125
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2012/pdf/305_Paper.pdf
- DOI:
- Cite (ACL):
- K Saravanan, Monojit Choudhury, Raghavendra Udupa, and A Kumaran. 2012. An Empirical Study of the Occurrence and Co-Occurrence of Named Entities in Natural Language Corpora. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 3118–3125, Istanbul, Turkey. European Language Resources Association (ELRA).
- Cite (Informal):
- An Empirical Study of the Occurrence and Co-Occurrence of Named Entities in Natural Language Corpora (Saravanan et al., LREC 2012)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2012/pdf/305_Paper.pdf