Implicit readability ranking using the latent variable of a Bayesian Probit model

Johan Falkenjack, Arne Jönsson


Abstract
Data driven approaches to readability analysis for languages other than English has been plagued by a scarcity of suitable corpora. Often, relevant corpora consist only of easy-to-read texts with no rank information or empirical readability scores, making only binary approaches, such as classification, applicable. We propose a Bayesian, latent variable, approach to get the most out of these kinds of corpora. In this paper we present results on using such a model for readability ranking. The model is evaluated on a preliminary corpus of ranked student texts with encouraging results. We also assess the model by showing that it performs readability classification on par with a state of the art classifier while at the same being transparent enough to allow more sophisticated interpretations.
Anthology ID:
W16-4112
Volume:
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)
Month:
December
Year:
2016
Address:
Osaka, Japan
Venue:
CL4LC
SIG:
Publisher:
The COLING 2016 Organizing Committee
Note:
Pages:
104–112
Language:
URL:
https://aclanthology.org/W16-4112
DOI:
Bibkey:
Cite (ACL):
Johan Falkenjack and Arne Jönsson. 2016. Implicit readability ranking using the latent variable of a Bayesian Probit model. In Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC), pages 104–112, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):
Implicit readability ranking using the latent variable of a Bayesian Probit model (Falkenjack & Jönsson, CL4LC 2016)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/W16-4112.pdf