ANTS: A Framework for Retrieval of Text Segments in Unstructured Documents
Brian Chivers, Mason P. Jiang, Wonhee Lee, Amy Ng, Natalya I. Rapstine, Alex Storer
Abstract
Text segmentation and extraction from unstructured documents can provide business researchers with a wealth of new information on firms and their behaviors. However, the most valuable text is often difficult to extract consistently due to substantial variations in how content can appear from document to document. Thus, the most successful way to extract this content has been through costly crowdsourcing and training of manual workers. We propose the Assisted Neural Text Segmentation (ANTS) framework to identify pertinent text in unstructured documents from a small set of labeled examples. ANTS leverages deep learning and transfer learning architectures to empower researchers to identify relevant text with minimal manual coding. Using a real world sample of accounting documents, we identify targeted sections 96% of the time using only 5 training examples.- Anthology ID:
- 2022.deeplo-1.5
- Volume:
- Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing
- Month:
- July
- Year:
- 2022
- Address:
- Hybrid
- Venue:
- DeepLo
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 38–47
- Language:
- URL:
- https://aclanthology.org/2022.deeplo-1.5
- DOI:
- 10.18653/v1/2022.deeplo-1.5
- Cite (ACL):
- Brian Chivers, Mason P. Jiang, Wonhee Lee, Amy Ng, Natalya I. Rapstine, and Alex Storer. 2022. ANTS: A Framework for Retrieval of Text Segments in Unstructured Documents. In Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pages 38–47, Hybrid. Association for Computational Linguistics.
- Cite (Informal):
- ANTS: A Framework for Retrieval of Text Segments in Unstructured Documents (Chivers et al., DeepLo 2022)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2022.deeplo-1.5.pdf