Surface-Level Morphological Segmentation of Low-resource Inuktitut Using Pre-trained Large Language Models

Mathias Stenlund; Hemanadhan Myneni; Morris Riedel

Surface-Level Morphological Segmentation of Low-resource Inuktitut Using Pre-trained Large Language Models

Mathias Stenlund, Hemanadhan Myneni, Morris Riedel

Abstract

Segmenting languages based on morpheme boundaries instead of relying on language independent segmenting algorithms like Byte-Pair Encoding (BPE) has shown to benefit downstream Natural Language Processing (NLP) task performance. This can however be tricky for polysynthetic languages like Inuktitut due to a high morpheme-to-word ratio and the lack of appropriately sized annotated datasets. Through our work, we display the potential of using pre-trained Large Language Models (LLMs) for surface-level morphological segmentation of Inuktitut by treating it as a binary classification task. We fine-tune on tasks derived from automatically annotated Inuktitut words written in Inuktitut syllabics. Our approach shows good potential when compared to previous neural approaches. We share our best model to encourage further studies on down stream NLP tasks for Inuktitut written in syllabics.

Anthology ID:: 2025.nodalida-1.69
Volume:: Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
Month:: march
Year:: 2025
Address:: Tallinn, Estonia
Editors:: Richard Johansson, Sara Stymne
Venue:: NoDaLiDa
SIG:
Publisher:: University of Tartu Library
Note:
Pages:: 688–696
Language:
URL:: https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.nodalida-1.69/
DOI:
Bibkey:
Cite (ACL):: Mathias Stenlund, Hemanadhan Myneni, and Morris Riedel. 2025. Surface-Level Morphological Segmentation of Low-resource Inuktitut Using Pre-trained Large Language Models. In Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), pages 688–696, Tallinn, Estonia. University of Tartu Library.
Cite (Informal):: Surface-Level Morphological Segmentation of Low-resource Inuktitut Using Pre-trained Large Language Models (Stenlund et al., NoDaLiDa 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.nodalida-1.69.pdf

PDF Cite Search Fix data