Abstract
Books are typically segmented into chapters and sections, representing coherent sub-narratives and topics. We investigate the task of predicting chapter boundaries, as a proxy for the general task of segmenting long texts. We build a Project Gutenberg chapter segmentation data set of 9,126 English novels, using a hybrid approach combining neural inference and rule matching to recognize chapter title headers in books, achieving an F1-score of 0.77 on this task. Using this annotated data as ground truth after removing structural cues, we present cut-based and neural methods for chapter segmentation, achieving a F1-score of 0.453 on the challenging task of exact break prediction over book-length documents. Finally, we reveal interesting historical trends in the chapter structure of novels.- Anthology ID:
- 2020.emnlp-main.672
- Volume:
- Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
- Month:
- November
- Year:
- 2020
- Address:
- Online
- Editors:
- Bonnie Webber, Trevor Cohn, Yulan He, Yang Liu
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 8373–8383
- Language:
- URL:
- https://aclanthology.org/2020.emnlp-main.672
- DOI:
- 10.18653/v1/2020.emnlp-main.672
- Cite (ACL):
- Charuta Pethe, Allen Kim, and Steve Skiena. 2020. Chapter Captor: Text Segmentation in Novels. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8373–8383, Online. Association for Computational Linguistics.
- Cite (Informal):
- Chapter Captor: Text Segmentation in Novels (Pethe et al., EMNLP 2020)
- PDF:
- https://preview.aclanthology.org/improve-issue-templates/2020.emnlp-main.672.pdf
- Code
- cpethe/chapter-captor