OpenCeres: When Open Information Extraction Meets the Semi-Structured Web

Colin Lockard, Prashant Shiralkar, Xin Luna Dong


Abstract
Open Information Extraction (OpenIE), the problem of harvesting triples from natural language text whose predicate relations are not aligned to any pre-defined ontology, has been a popular subject of research for the last decade. However, this research has largely ignored the vast quantity of facts available in semi-structured webpages. In this paper, we define the problem of OpenIE from semi-structured websites to extract such facts, and present an approach for solving it. We also introduce a labeled evaluation dataset to motivate research in this area. Given a semi-structured website and a set of seed facts for some relations existing on its pages, we employ a semi-supervised label propagation technique to automatically create training data for the relations present on the site. We then use this training data to learn a classifier for relation extraction. Experimental results of this method on our new benchmark dataset obtained a precision of over 70%. A larger scale extraction experiment on 31 websites in the movie vertical resulted in the extraction of over 2 million triples.
Anthology ID:
N19-1309
Volume:
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
Month:
June
Year:
2019
Address:
Minneapolis, Minnesota
Editors:
Jill Burstein, Christy Doran, Thamar Solorio
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3047–3056
Language:
URL:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/N19-1309/
DOI:
10.18653/v1/N19-1309
Bibkey:
Cite (ACL):
Colin Lockard, Prashant Shiralkar, and Xin Luna Dong. 2019. OpenCeres: When Open Information Extraction Meets the Semi-Structured Web. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3047–3056, Minneapolis, Minnesota. Association for Computational Linguistics.
Cite (Informal):
OpenCeres: When Open Information Extraction Meets the Semi-Structured Web (Lockard et al., NAACL 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/N19-1309.pdf
Video:
 https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/N19-1309.mp4