James Wendt


2023

pdf
Selective Labeling: How to Radically Lower Data-Labeling Costs for Document Extraction Models
Yichao Zhou | James Wendt | Navneet Potti | Jing Xie | Sandeep Tata
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Building automatic extraction models for visually rich documents like invoices, receipts, bills, tax forms, etc. has received significant attention lately. A key bottleneck in developing extraction models for new document types is the cost of acquiring the several thousand high-quality labeled documents that are needed to train a model with acceptable accuracy. In this paper, we propose selective labeling as a solution to this problem. The key insight is to simplify the labeling task to provide “yes/no” labels for candidate extractions predicted by a model trained on partially labeled documents. We combine this with a custom active learning strategy to find the predictions that the model is most uncertain about. We show through experiments on document types drawn from 3 different domains that selective labeling can reduce the cost of acquiring labeled data by 10× with a negligible loss in accuracy.