Developing a Spell and Grammar Checker for Icelandic using an Error Corpus

Hulda Óladóttir, Þórunn Arnardóttir, Anton Ingason, Vilhjálmur Þorsteinsson


Abstract
A lack of datasets for spelling and grammatical error correction in Icelandic, along with language-specific issues, has caused a dearth of spell and grammar checking systems for the language. We present the first open-source spell and grammar checking tool for Icelandic, using an error corpus at all stages. This error corpus was in part created to aid in the development of the tool. The system is built with a rule-based tool stack comprising a tokenizer, a morphological tagger, and a parser. For token-level error annotation, tokenization rules, word lists, and a trigram model are used in error detection and correction. For sentence-level error annotation, we use specific error grammar rules in the parser as well as regex-like patterns to search syntax trees. The error corpus gives valuable insight into the errors typically made when Icelandic text is written, and guided each development phase in a test-driven manner. We assess the system’s performance with both automatic and human evaluation, using the test set in the error corpus as a reference in the automatic evaluation. The data in the error corpus development set proved useful in various ways for error detection and correction.
Anthology ID:
2022.lrec-1.496
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4644–4653
Language:
URL:
https://aclanthology.org/2022.lrec-1.496
DOI:
Bibkey:
Cite (ACL):
Hulda Óladóttir, Þórunn Arnardóttir, Anton Ingason, and Vilhjálmur Þorsteinsson. 2022. Developing a Spell and Grammar Checker for Icelandic using an Error Corpus. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4644–4653, Marseille, France. European Language Resources Association.
Cite (Informal):
Developing a Spell and Grammar Checker for Icelandic using an Error Corpus (Óladóttir et al., LREC 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp22-frontmatter/2022.lrec-1.496.pdf