Abstract
Context-aware historical text normalisation is a severely under-researched area. To fill the gap we propose a context-aware normalisation approach that relies on the state-of-the-art methods in neural machine translation and transfer learning. We propose a multidialect normaliser with a context-aware reranking of the candidates. The reranker relies on a word-level n-gram language model that is applied to the five best normalisation candidates. The results are evaluated on the historical multidialect datasets of German, Spanish, Portuguese and Slovene. We show that incorporating dialectal information into the training leads to an accuracy improvement on all the datasets. The context-aware reranking gives further improvement over the baseline. For three out of six datasets, we reach a significantly higher accuracy than reported in the previous studies. The other three results are comparable with the current state-of-the-art. The code for the reranker is published as open-source.- Anthology ID:
- 2020.coling-main.89
- Volume:
- Proceedings of the 28th International Conference on Computational Linguistics
- Month:
- December
- Year:
- 2020
- Address:
- Barcelona, Spain (Online)
- Venue:
- COLING
- SIG:
- Publisher:
- International Committee on Computational Linguistics
- Note:
- Pages:
- 1023–1036
- Language:
- URL:
- https://aclanthology.org/2020.coling-main.89
- DOI:
- 10.18653/v1/2020.coling-main.89
- Cite (ACL):
- Maria Sukhareva. 2020. Context-Aware Text Normalisation for Historical Dialects. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1023–1036, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Cite (Informal):
- Context-Aware Text Normalisation for Historical Dialects (Sukhareva, COLING 2020)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2020.coling-main.89.pdf