Abstract
We introduce the task of book structure labeling: segmenting and assigning a fixed category (such as Table of Contents, Preface, Index) to the document structure of printed books. We manually annotate the page-level structural categories for a large dataset totaling 294,816 pages in 1,055 books evenly sampled from 1750-1922, and present empirical results comparing the performance of several classes of models. The best-performing model, a bidirectional LSTM with rich features, achieves an overall accuracy of 95.8 and a class-balanced macro F-score of 71.4.- Anthology ID:
- D17-1077
- Volume:
- Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
- Month:
- September
- Year:
- 2017
- Address:
- Copenhagen, Denmark
- Venue:
- EMNLP
- SIG:
- SIGDAT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 737–747
- Language:
- URL:
- https://aclanthology.org/D17-1077
- DOI:
- 10.18653/v1/D17-1077
- Cite (ACL):
- Lara McConnaughey, Jennifer Dai, and David Bamman. 2017. The Labeled Segmentation of Printed Books. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 737–747, Copenhagen, Denmark. Association for Computational Linguistics.
- Cite (Informal):
- The Labeled Segmentation of Printed Books (McConnaughey et al., EMNLP 2017)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/D17-1077.pdf