Abstract
In computer-mediated communication, Latin-based scripts users often omit diacritics when writing. Such text is typically easily understandable to humans but very difficult for computational processing because many words become ambiguous or unknown. Letter-level approaches to diacritic restoration generalise better and do not require a lot of training data but word-level approaches tend to yield better results. However, they typically rely on a lexicon which is an expensive resource, not covering non-standard forms, and often not available for less-resourced languages. In this paper we present diacritic restoration models that are trained on easy-to-acquire corpora. We test three different types of corpora (Wikipedia, general web, Twitter) for three South Slavic languages (Croatian, Serbian and Slovene) and evaluate them on two types of text: standard (Wikipedia) and non-standard (Twitter). The proposed approach considerably outperforms charlifter, so far the only open source tool available for this task. We make the best performing systems freely available.- Anthology ID:
 - L16-1573
 - Volume:
 - Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
 - Month:
 - May
 - Year:
 - 2016
 - Address:
 - Portorož, Slovenia
 - Editors:
 - Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
 - Venue:
 - LREC
 - SIG:
 - Publisher:
 - European Language Resources Association (ELRA)
 - Note:
 - Pages:
 - 3612–3616
 - Language:
 - URL:
 - https://aclanthology.org/L16-1573
 - DOI:
 - Cite (ACL):
 - Nikola Ljubešić, Tomaž Erjavec, and Darja Fišer. 2016. Corpus-Based Diacritic Restoration for South Slavic Languages. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 3612–3616, Portorož, Slovenia. European Language Resources Association (ELRA).
 - Cite (Informal):
 - Corpus-Based Diacritic Restoration for South Slavic Languages (Ljubešić et al., LREC 2016)
 - PDF:
 - https://preview.aclanthology.org/ingest-acl-2023-videos/L16-1573.pdf