OpenCeres: When Open Information Extraction Meets the Semi-Structured Web

Colin Lockard, Prashant Shiralkar, Xin Luna Dong

[How to correct problems with metadata yourself]


Abstract
Open Information Extraction (OpenIE), the problem of harvesting triples from natural language text whose predicate relations are not aligned to any pre-defined ontology, has been a popular subject of research for the last decade. However, this research has largely ignored the vast quantity of facts available in semi-structured webpages. In this paper, we define the problem of OpenIE from semi-structured websites to extract such facts, and present an approach for solving it. We also introduce a labeled evaluation dataset to motivate research in this area. Given a semi-structured website and a set of seed facts for some relations existing on its pages, we employ a semi-supervised label propagation technique to automatically create training data for the relations present on the site. We then use this training data to learn a classifier for relation extraction. Experimental results of this method on our new benchmark dataset obtained a precision of over 70%. A larger scale extraction experiment on 31 websites in the movie vertical resulted in the extraction of over 2 million triples.
Anthology ID:
N19-1309
Volume:
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
Month:
June
Year:
2019
Address:
Minneapolis, Minnesota
Editors:
Jill Burstein, Christy Doran, Thamar Solorio
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3047–3056
Language:
URL:
https://aclanthology.org/N19-1309
DOI:
10.18653/v1/N19-1309
Bibkey:
Cite (ACL):
Colin Lockard, Prashant Shiralkar, and Xin Luna Dong. 2019. OpenCeres: When Open Information Extraction Meets the Semi-Structured Web. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3047–3056, Minneapolis, Minnesota. Association for Computational Linguistics.
Cite (Informal):
OpenCeres: When Open Information Extraction Meets the Semi-Structured Web (Lockard et al., NAACL 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/teach-a-man-to-fish/N19-1309.pdf
Video:
 https://preview.aclanthology.org/teach-a-man-to-fish/N19-1309.mp4