Representation Learning for Information Extraction from Form-like Documents

Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata, James Bradley Wendt, Qi Zhao, Marc Najork


Abstract
We propose a novel approach using representation learning for tackling the problem of extracting structured information from form-like document images. We propose an extraction system that uses knowledge of the types of the target fields to generate extraction candidates and a neural network architecture that learns a dense representation of each candidate based on neighboring words in the document. These learned representations are not only useful in solving the extraction task for unseen document templates from two different domains but are also interpretable, as we show using loss cases.
Anthology ID:
2020.acl-main.580
Volume:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2020
Address:
Online
Editors:
Dan Jurafsky, Joyce Chai, Natalie Schluter, Joel Tetreault
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6495–6504
Language:
URL:
https://aclanthology.org/2020.acl-main.580
DOI:
10.18653/v1/2020.acl-main.580
Bibkey:
Cite (ACL):
Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata, James Bradley Wendt, Qi Zhao, and Marc Najork. 2020. Representation Learning for Information Extraction from Form-like Documents. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6495–6504, Online. Association for Computational Linguistics.
Cite (Informal):
Representation Learning for Information Extraction from Form-like Documents (Majumder et al., ACL 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-2/2020.acl-main.580.pdf
Video:
 http://slideslive.com/38929320