SECTOR: A Neural Model for Coherent Topic Segmentation and Classification

Sebastian Arnold, Rudolf Schneider, Philippe Cudré-Mauroux, Felix A. Gers, Alexander Löser


Abstract
When searching for information, a human reader first glances over a document, spots relevant sections, and then focuses on a few sentences for resolving her intention. However, the high variance of document structure complicates the identification of the salient topic of a given section at a glance. To tackle this challenge, we present SECTOR, a model to support machine reading systems by segmenting documents into coherent sections and assigning topic labels to each section. Our deep neural network architecture learns a latent topic embedding over the course of a document. This can be leveraged to classify local topics from plain text and segment a document at topic shifts. In addition, we contribute WikiSection, a publicly available data set with 242k labeled sections in English and German from two distinct domains: diseases and cities. From our extensive evaluation of 20 architectures, we report a highest score of 71.6% F1 for the segmentation and classification of 30 topics from the English city domain, scored by our SECTOR long short-term memory model with Bloom filter embeddings and bidirectional segmentation. This is a significant improvement of 29.5 points F1 over state-of-the-art CNN classifiers with baseline segmentation.
Anthology ID:
Q19-1011
Volume:
Transactions of the Association for Computational Linguistics, Volume 7
Month:
Year:
2019
Address:
Cambridge, MA
Editors:
Lillian Lee, Mark Johnson, Brian Roark, Ani Nenkova
Venue:
TACL
SIG:
Publisher:
MIT Press
Note:
Pages:
169–184
Language:
URL:
https://aclanthology.org/Q19-1011
DOI:
10.1162/tacl_a_00261
Bibkey:
Cite (ACL):
Sebastian Arnold, Rudolf Schneider, Philippe Cudré-Mauroux, Felix A. Gers, and Alexander Löser. 2019. SECTOR: A Neural Model for Coherent Topic Segmentation and Classification. Transactions of the Association for Computational Linguistics, 7:169–184.
Cite (Informal):
SECTOR: A Neural Model for Coherent Topic Segmentation and Classification (Arnold et al., TACL 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-dup-bibkey/Q19-1011.pdf
Presentation:
 Q19-1011.Presentation.pdf
Video:
 https://preview.aclanthology.org/fix-dup-bibkey/Q19-1011.mp4
Code
 sebastianarnold/SECTOR +  additional community code
Data
WikiSection