Abstract
Data driven approaches to readability analysis for languages other than English has been plagued by a scarcity of suitable corpora. Often, relevant corpora consist only of easy-to-read texts with no rank information or empirical readability scores, making only binary approaches, such as classification, applicable. We propose a Bayesian, latent variable, approach to get the most out of these kinds of corpora. In this paper we present results on using such a model for readability ranking. The model is evaluated on a preliminary corpus of ranked student texts with encouraging results. We also assess the model by showing that it performs readability classification on par with a state of the art classifier while at the same being transparent enough to allow more sophisticated interpretations.- Anthology ID:
- W16-4112
- Volume:
- Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)
- Month:
- December
- Year:
- 2016
- Address:
- Osaka, Japan
- Venue:
- CL4LC
- SIG:
- Publisher:
- The COLING 2016 Organizing Committee
- Note:
- Pages:
- 104–112
- Language:
- URL:
- https://aclanthology.org/W16-4112
- DOI:
- Cite (ACL):
- Johan Falkenjack and Arne Jönsson. 2016. Implicit readability ranking using the latent variable of a Bayesian Probit model. In Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC), pages 104–112, Osaka, Japan. The COLING 2016 Organizing Committee.
- Cite (Informal):
- Implicit readability ranking using the latent variable of a Bayesian Probit model (Falkenjack & Jönsson, CL4LC 2016)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/W16-4112.pdf