Abstract
Morphological word segmentation splits a given word into its morphemes (roots and affixes), the smallest meaning-bearing units of language. We introduce a novel approach, called LLMSegm, to surface-level morphological segmentation leveraging large language models (LLMs). The proposed approach is applicable in low-data settings as well as for low-resourced languages. We show how to transform the surface-level morphological segmentation task to a binary classification problem and train LLMs to solve it efficiently. For input, we leverage the information from the default LLM subword tokenisation, and a custom morphological segmentation using novel encoding. The evaluation of LLMSegm across seven morphologically diverse languages demonstrates substantial gains in minimally-supervised settings as well as for low-resourced languages, compared to several existing competitive approaches. In terms of F1-scores and accuracy, we achieve improved results compared to the competing methods in six out of seven datasets. Keywords: morphological segmentation, surface-level segmentation, large language models, low-resource settings- Anthology ID:
- 2024.lrec-main.933
- Volume:
- Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
- Month:
- May
- Year:
- 2024
- Address:
- Torino, Italia
- Editors:
- Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
- Venues:
- LREC | COLING
- SIG:
- Publisher:
- ELRA and ICCL
- Note:
- Pages:
- 10665–10674
- Language:
- URL:
- https://aclanthology.org/2024.lrec-main.933
- DOI:
- Cite (ACL):
- Marko Pranjić, Marko Robnik-Šikonja, and Senja Pollak. 2024. LLMSegm: Surface-level Morphological Segmentation Using Large Language Model. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 10665–10674, Torino, Italia. ELRA and ICCL.
- Cite (Informal):
- LLMSegm: Surface-level Morphological Segmentation Using Large Language Model (Pranjić et al., LREC-COLING 2024)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/2024.lrec-main.933.pdf