Semitic Root Encoding: Tokenization Based on the Templatic Morphology of Semitic Languages in NMT

Brendan T. Hatch; Stephen D. Richardson

Semitic Root Encoding: Tokenization Based on the Templatic Morphology of Semitic Languages in NMT

Abstract

The morphological structure of Semitic languages, such as Arabic, is based on non-concatenative roots and templates. This complex word structure used by humans is obscured to neural models that employ traditional tokenization algorithms, such as byte-pair encoding (BPE) (Sennrich et al., 2016; Gage, 1994). In this work, we present and evaluate Semitic Root Encoding (SRE), a tokenization method that represents both concatenative and non-concatenative structures in Semitic words with sequences of root, template stem, and BPE tokens. We apply the method to neural machine translation (NMT) and find that SRE tokenization yields an average increase of 1.15 BLEU over the baseline. SRE tokenization is also robust against generating combinations of roots with template stems that do not occur in nature. Finally, we compare the performance of SRE to tokenization based on non-linguistic root and template structures and tokenization based on stems, providing evidence that NMT models are capable of leveraging tokens based on non-concatenative Semitic morphology.

Anthology ID:: 2025.arabicnlp-main.3
Volume:: Proceedings of The Third Arabic Natural Language Processing Conference
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Kareem Darwish, Ahmed Ali, Ibrahim Abu Farha, Samia Touileb, Imed Zitouni, Ahmed Abdelali, Sharefah Al-Ghamdi, Sakhar Alkhereyf, Wajdi Zaghouani, Salam Khalifa, Badr AlKhamissi, Rawan Almatham, Injy Hamed, Zaid Alyafeai, Areeb Alowisheq, Go Inoue, Khalil Mrini, Waad Alshammari
Venue:: ArabicNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 26–41
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.arabicnlp-main.3/
DOI:
Bibkey:
Cite (ACL):: Brendan T. Hatch and Stephen D. Richardson. 2025. Semitic Root Encoding: Tokenization Based on the Templatic Morphology of Semitic Languages in NMT. In Proceedings of The Third Arabic Natural Language Processing Conference, pages 26–41, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Semitic Root Encoding: Tokenization Based on the Templatic Morphology of Semitic Languages in NMT (Hatch & Richardson, ArabicNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.arabicnlp-main.3.pdf

PDF Cite Search Fix data