Out in the Open: Finding and Categorising Errors in the Lexical Simplification Pipeline

Matthew Shardlow


Abstract
Lexical simplification is the task of automatically reducing the complexity of a text by identifying difficult words and replacing them with simpler alternatives. Whilst this is a valuable application of natural language generation, rudimentary lexical simplification systems suffer from a high error rate which often results in nonsensical, non-simple text. This paper seeks to characterise and quantify the errors which occur in a typical baseline lexical simplification system. We expose 6 distinct categories of error and propose a classification scheme for these. We also quantify these errors for a moderate size corpus, showing the magnitude of each error type. We find that for 183 identified simplification instances, only 19 (10.38%) result in a valid simplification, with the rest causing errors of varying gravity.
Anthology ID:
L14-1403
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1583–1590
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/479_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Matthew Shardlow. 2014. Out in the Open: Finding and Categorising Errors in the Lexical Simplification Pipeline. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 1583–1590, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
Out in the Open: Finding and Categorising Errors in the Lexical Simplification Pipeline (Shardlow, LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/479_Paper.pdf