Extensions to Brahmic script processing within the Nisaba library: new scripts, languages and utilities
Alexander Gutkin, Cibu Johny, Raiomond Doctor, Lawrence Wolf-Sonkin, Brian Roark
Abstract
The Brahmic family of scripts is used to record some of the most spoken languages in the world and is arguably the most diverse family of writing systems. In this work, we present several substantial extensions to Brahmic script functionality within the open-source Nisaba library of finite-state script normalization and processing utilities (Johny et al., 2021). First, we extend coverage from the original ten scripts to an additional ten scripts of South Asia and beyond, including some used to record endangered languages such as Dogri. Second, we augment the language layer so that scripts used by multiple languages in distinct ways can be processed correctly for more languages, such as the Bengali script when used for the low-resource language Santali. We document key changes to the finite-state engine required to support these new languages and scripts. Finally, we add new script processing utilities, including lightweight script-level reading normalization that (unlike existing visual normalization) does not preserve visual invariance, and a fixed-input transliteration mechanism specifically tailored to Brahmic text entry with ASCII characters.- Anthology ID:
- 2022.lrec-1.692
- Volume:
- Proceedings of the Thirteenth Language Resources and Evaluation Conference
- Month:
- June
- Year:
- 2022
- Address:
- Marseille, France
- Editors:
- Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 6450–6460
- Language:
- URL:
- https://preview.aclanthology.org/add_missing_videos/2022.lrec-1.692/
- DOI:
- Cite (ACL):
- Alexander Gutkin, Cibu Johny, Raiomond Doctor, Lawrence Wolf-Sonkin, and Brian Roark. 2022. Extensions to Brahmic script processing within the Nisaba library: new scripts, languages and utilities. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6450–6460, Marseille, France. European Language Resources Association.
- Cite (Informal):
- Extensions to Brahmic script processing within the Nisaba library: new scripts, languages and utilities (Gutkin et al., LREC 2022)
- PDF:
- https://preview.aclanthology.org/add_missing_videos/2022.lrec-1.692.pdf