MoVoC: Morphology-Aware Subword Construction for Ge’ez Script Languages

Hailay Kidu Teklehaymanot, Dren Fazlija, Wolfgang Nejdl


Abstract
Subword-based tokenization methods often fail to preserve morphological boundaries, a limitation especially pronounced in low-resource, morphologically complex languages such as those written in the Ge‘ez script. To address this, we present MoVoC (Morpheme-aware Subword Vocabulary Construction) and train MoVoC-Tok, a tokenizer that integrates supervised morphological analysis into the subword vocabulary. This hybrid segmentation approach combines morpheme-based and Byte Pair Encoding (BPE) tokens to preserve morphological integrity while maintaining lexical meaning. To tackle resource scarcity, we curate and release manually annotated morpheme data for four Ge‘ez script languages and a morpheme-aware vocabulary for two of them. While the proposed tokenization method does not lead to significant gains in automatic translation quality, we observe consistent improvements in intrinsic metrics, MorphoScore, and Boundary Precision, highlighting the value of morphology-aware segmentation in enhancing linguistic fidelity and token efficiency. Our morpheme-annotated datasets and tokenizer dataset will be publicly available under the Open Data licenses to support further research in low-resource, morphologically rich languages.
Anthology ID:
2025.findings-emnlp.706
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13131–13144
Language:
URL:
https://preview.aclanthology.org/ingest-luhme/2025.findings-emnlp.706/
DOI:
10.18653/v1/2025.findings-emnlp.706
Bibkey:
Cite (ACL):
Hailay Kidu Teklehaymanot, Dren Fazlija, and Wolfgang Nejdl. 2025. MoVoC: Morphology-Aware Subword Construction for Ge’ez Script Languages. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 13131–13144, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
MoVoC: Morphology-Aware Subword Construction for Ge’ez Script Languages (Teklehaymanot et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-luhme/2025.findings-emnlp.706.pdf
Checklist:
 2025.findings-emnlp.706.checklist.pdf