Simi Johnson


Weak Supervision using Linguistic Knowledge for Information Extraction
Sachin Pawar | Girish Palshikar | Ankita Jain | Jyoti Bhat | Simi Johnson
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

In this paper, we propose to use linguistic knowledge to automatically augment a small manually annotated corpus to obtain a large annotated corpus for training Information Extraction models. We propose a powerful patterns specification language for specifying linguistic rules for entity extraction. We define an Enriched Text Format (ETF) to represent rich linguistic information about a text in the form of XML-like tags. The patterns in our patterns specification language are then matched on the ETF text rather than raw text to extract various entity mentions. We demonstrate how an entity extraction system can be quickly built for a domain-specific entity type for which there are no readily available annotated datasets.