The Effect of Model Capacity and Script Diversity on Subword Tokenization for Sorani Kurdish

Ali Salehi, Cassandra L. Jacobs


Abstract
Tokenization and morphological segmentation continue to pose challenges for text processing and studies of human language. Here, we focus on written Soranî Kurdish, which uses a modified script based on Persian and Arabic, and its transliterations into the Kurdish Latin script. Importantly, Perso-Arabic and Latin-based writing systems demonstrate different statistical and structural properties, which may have significant effects on subword vocabulary learning. This has major consequences for frequency- or probability-based models of morphological induction. We explore the possibility that jointly training subword vocabularies using a source script along with its transliteration would improve morphological segmentation, subword tokenization, and whether gains are observed for one system over others. We find that joint training has a similar effect to increasing vocabulary size, while keeping subwords shorter in length, which produces higher-quality subwords that map onto morphemes.
Anthology ID:
2024.sigmorphon-1.6
Volume:
Proceedings of the 21st SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Garrett Nicolai, Eleanor Chodroff, Frederic Mailhot, Çağrı Çöltekin
Venue:
SIGMORPHON
SIG:
SIGMORPHON
Publisher:
Association for Computational Linguistics
Note:
Pages:
51–56
Language:
URL:
https://aclanthology.org/2024.sigmorphon-1.6
DOI:
Bibkey:
Cite (ACL):
Ali Salehi and Cassandra L. Jacobs. 2024. The Effect of Model Capacity and Script Diversity on Subword Tokenization for Sorani Kurdish. In Proceedings of the 21st SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 51–56, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
The Effect of Model Capacity and Script Diversity on Subword Tokenization for Sorani Kurdish (Salehi & Jacobs, SIGMORPHON 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-volume-bibkeys/2024.sigmorphon-1.6.pdf