Abstract
Open Information Extraction (OpenIE), the problem of harvesting triples from natural language text whose predicate relations are not aligned to any pre-defined ontology, has been a popular subject of research for the last decade. However, this research has largely ignored the vast quantity of facts available in semi-structured webpages. In this paper, we define the problem of OpenIE from semi-structured websites to extract such facts, and present an approach for solving it. We also introduce a labeled evaluation dataset to motivate research in this area. Given a semi-structured website and a set of seed facts for some relations existing on its pages, we employ a semi-supervised label propagation technique to automatically create training data for the relations present on the site. We then use this training data to learn a classifier for relation extraction. Experimental results of this method on our new benchmark dataset obtained a precision of over 70%. A larger scale extraction experiment on 31 websites in the movie vertical resulted in the extraction of over 2 million triples.- Anthology ID:
- N19-1309
- Volume:
- Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
- Month:
- June
- Year:
- 2019
- Address:
- Minneapolis, Minnesota
- Editors:
- Jill Burstein, Christy Doran, Thamar Solorio
- Venue:
- NAACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 3047–3056
- Language:
- URL:
- https://aclanthology.org/N19-1309
- DOI:
- 10.18653/v1/N19-1309
- Cite (ACL):
- Colin Lockard, Prashant Shiralkar, and Xin Luna Dong. 2019. OpenCeres: When Open Information Extraction Meets the Semi-Structured Web. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3047–3056, Minneapolis, Minnesota. Association for Computational Linguistics.
- Cite (Informal):
- OpenCeres: When Open Information Extraction Meets the Semi-Structured Web (Lockard et al., NAACL 2019)
- PDF:
- https://preview.aclanthology.org/ingest-acl-2023-videos/N19-1309.pdf