Selective Labeling: How to Radically Lower Data-Labeling Costs for Document Extraction Models
Yichao Zhou, James Bradley Wendt, Navneet Potti, Jing Xie, Sandeep Tata
Abstract
Building automatic extraction models for visually rich documents like invoices, receipts, bills, tax forms, etc. has received significant attention lately. A key bottleneck in developing extraction models for new document types is the cost of acquiring the several thousand high-quality labeled documents that are needed to train a model with acceptable accuracy. In this paper, we propose selective labeling as a solution to this problem. The key insight is to simplify the labeling task to provide “yes/no” labels for candidate extractions predicted by a model trained on partially labeled documents. We combine this with a custom active learning strategy to find the predictions that the model is most uncertain about. We show through experiments on document types drawn from 3 different domains that selective labeling can reduce the cost of acquiring labeled data by 10× with a negligible loss in accuracy.- Anthology ID:
- 2023.emnlp-main.233
- Volume:
- Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Houda Bouamor, Juan Pino, Kalika Bali
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 3847–3860
- Language:
- URL:
- https://aclanthology.org/2023.emnlp-main.233
- DOI:
- 10.18653/v1/2023.emnlp-main.233
- Cite (ACL):
- Yichao Zhou, James Bradley Wendt, Navneet Potti, Jing Xie, and Sandeep Tata. 2023. Selective Labeling: How to Radically Lower Data-Labeling Costs for Document Extraction Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3847–3860, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- Selective Labeling: How to Radically Lower Data-Labeling Costs for Document Extraction Models (Zhou et al., EMNLP 2023)
- PDF:
- https://preview.aclanthology.org/fix-volume-bibkeys/2023.emnlp-main.233.pdf