Field Extraction from Forms with Unlabeled Data
Mingfei Gao, Zeyuan Chen, Nikhil Naik, Kazuma Hashimoto, Caiming Xiong, Ran Xu
Abstract
We propose a novel framework to conduct field extraction from forms with unlabeled data. To bootstrap the training process, we develop a rule-based method for mining noisy pseudo-labels from unlabeled forms. Using the supervisory signal from the pseudo-labels, we extract a discriminative token representation from a transformer-based model by modeling the interaction between text in the form. To prevent the model from overfitting to label noise, we introduce a refinement module based on a progressive pseudo-label ensemble. Experimental results demonstrate the effectiveness of our framework.- Anthology ID:
- 2022.spanlp-1.4
- Volume:
- Proceedings of the 1st Workshop on Semiparametric Methods in NLP: Decoupling Logic from Knowledge
- Month:
- May
- Year:
- 2022
- Address:
- Dublin, Ireland and Online
- Editors:
- Rajarshi Das, Patrick Lewis, Sewon Min, June Thai, Manzil Zaheer
- Venue:
- SpaNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 30–40
- Language:
- URL:
- https://aclanthology.org/2022.spanlp-1.4
- DOI:
- 10.18653/v1/2022.spanlp-1.4
- Cite (ACL):
- Mingfei Gao, Zeyuan Chen, Nikhil Naik, Kazuma Hashimoto, Caiming Xiong, and Ran Xu. 2022. Field Extraction from Forms with Unlabeled Data. In Proceedings of the 1st Workshop on Semiparametric Methods in NLP: Decoupling Logic from Knowledge, pages 30–40, Dublin, Ireland and Online. Association for Computational Linguistics.
- Cite (Informal):
- Field Extraction from Forms with Unlabeled Data (Gao et al., SpaNLP 2022)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-1/2022.spanlp-1.4.pdf
- Code
- salesforce/inv-cdip + additional community code