Context-Aware Text Normalisation for Historical Dialects

Maria Sukhareva


Abstract
Context-aware historical text normalisation is a severely under-researched area. To fill the gap we propose a context-aware normalisation approach that relies on the state-of-the-art methods in neural machine translation and transfer learning. We propose a multidialect normaliser with a context-aware reranking of the candidates. The reranker relies on a word-level n-gram language model that is applied to the five best normalisation candidates. The results are evaluated on the historical multidialect datasets of German, Spanish, Portuguese and Slovene. We show that incorporating dialectal information into the training leads to an accuracy improvement on all the datasets. The context-aware reranking gives further improvement over the baseline. For three out of six datasets, we reach a significantly higher accuracy than reported in the previous studies. The other three results are comparable with the current state-of-the-art. The code for the reranker is published as open-source.
Anthology ID:
2020.coling-main.89
Volume:
Proceedings of the 28th International Conference on Computational Linguistics
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Editors:
Donia Scott, Nuria Bel, Chengqing Zong
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
1023–1036
Language:
URL:
https://aclanthology.org/2020.coling-main.89
DOI:
10.18653/v1/2020.coling-main.89
Bibkey:
Cite (ACL):
Maria Sukhareva. 2020. Context-Aware Text Normalisation for Historical Dialects. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1023–1036, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Cite (Informal):
Context-Aware Text Normalisation for Historical Dialects (Sukhareva, COLING 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2020.coling-main.89.pdf