Abstract
Document zone identification aims to automatically classify sequences of text-spans (e.g. sentences) within a document into predefined zone categories. Current approaches to document zone identification mostly rely on supervised machine learning methods, which require a large amount of annotated data, which is often difficult and expensive to obtain. In order to overcome this bottleneck, we propose graphical models based on the popular Latent Dirichlet Allocation (LDA) model. The first model, which we call zoneLDA aims to cluster the sentences into zone classes using only unlabelled data. We also study an extension of zoneLDA called zoneLDAb, which makes distinction between common words and non-common words within the different zone types. We present results on two different domains: the scientific domain and the technical domain. For the latter one we propose a new document zone classification schema, which has been annotated over a collection of 689 documents, achieving a Kappa score of 85%. Overall our experiments show promising results for both of the domains, outperforming the baseline model. Furthermore, on the technical domain the performance of the models are comparable to the supervised approach using the same feature sets. We thus believe that graphical models are a promising avenue of research for automatic document zoning.- Anthology ID:
- L12-1523
- Volume:
- Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
- Month:
- May
- Year:
- 2012
- Address:
- Istanbul, Turkey
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 1610–1617
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2012/pdf/881_Paper.pdf
- DOI:
- Cite (ACL):
- Andrea Varga, Daniel Preoţiuc-Pietro, and Fabio Ciravegna. 2012. Unsupervised document zone identification using probabilistic graphical models. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 1610–1617, Istanbul, Turkey. European Language Resources Association (ELRA).
- Cite (Informal):
- Unsupervised document zone identification using probabilistic graphical models (Varga et al., LREC 2012)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2012/pdf/881_Paper.pdf