Regular Expression Guided Entity Mention Mining from Noisy Web Data

Shanshan Zhang, Lihong He, Slobodan Vucetic, Eduard Dragut

[How to correct problems with metadata yourself]


Abstract
Many important entity types in web documents, such as dates, times, email addresses, and course numbers, follow or closely resemble patterns that can be described by Regular Expressions (REs). Due to a vast diversity of web documents and ways in which they are being generated, even seemingly straightforward tasks such as identifying mentions of date in a document become very challenging. It is reasonable to claim that it is impossible to create a RE that is capable of identifying such entities from web documents with perfect precision and recall. Rather than abandoning REs as a go-to approach for entity detection, this paper explores ways to combine the expressive power of REs, ability of deep learning to learn from large data, and human-in-the loop approach into a new integrated framework for entity identification from web data. The framework starts by creating or collecting the existing REs for a particular type of an entity. Those REs are then used over a large document corpus to collect weak labels for the entity mentions and a neural network is trained to predict those RE-generated weak labels. Finally, a human expert is asked to label a small set of documents and the neural network is fine tuned on those documents. The experimental evaluation on several entity identification problems shows that the proposed framework achieves impressive accuracy, while requiring very modest human effort.
Anthology ID:
D18-1224
Volume:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Month:
October-November
Year:
2018
Address:
Brussels, Belgium
Editors:
Ellen Riloff, David Chiang, Julia Hockenmaier, Jun’ichi Tsujii
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
1991–2000
Language:
URL:
https://aclanthology.org/D18-1224
DOI:
10.18653/v1/D18-1224
Bibkey:
Cite (ACL):
Shanshan Zhang, Lihong He, Slobodan Vucetic, and Eduard Dragut. 2018. Regular Expression Guided Entity Mention Mining from Noisy Web Data. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1991–2000, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
Regular Expression Guided Entity Mention Mining from Noisy Web Data (Zhang et al., EMNLP 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/teach-a-man-to-fish/D18-1224.pdf
Attachment:
 D18-1224.Attachment.zip