Mathias Stenlund


2026

Pretrained large language models (LLMs) gain instruction-following abilities through instruction-tuning, a method which relies on datasets of instruction–response pairs. However, for low-resource languages, collecting human-authored instructions is costly, raising the question of whether synthetic instructions can substitute human-authored instructions for non-English languages. We compare instruction-tuning of a smaller pretrained LLM in four Nordic languages using (a) human-authored instructions paired with synthetic responses and (b) fully synthetic instruction–response pairs generated with a minimal-effort pipeline. Native-speaker evaluations show that models instruction-tuned on synthetic instructions perform on par with those trained on human-authored instructions for the largest Nordic languages, suggesting that minimal-effort synthetic instructions can serve as a practical alternative. In contrast, response quality deteriorates sharply for Icelandic, underscoring the limitations of current synthetic data generation pipelines when the LLM competence in the target language is weak. Overall, our results highlight that while synthetic instructions can enable cost-efficient instruction-tuning for the largest Nordic languages, they remain insufficient for Icelandic, clarifying when minimal-effort synthetic approaches suffice and when they fall short.
We present a methodology for creating high-quality instruction prompts for low-resource Germanic languages that addresses a critical challenge: small annotator pools risk producing datasets reflecting narrow individual interests rather than diverse user needs. In this work, native speakers reformulate existing English prompts from OpenAssistant or create entirely original prompts, adapting them to reflect local contexts and natural language patterns while preserving broad task and topic diversity. This approach produced high-quality prompt datasets totaling 6,950 prompts across seven Germanic languages (German, Dutch, Swedish, Norwegian Bokmål/Nynorsk, Danish, Icelandic and Faroese) with validated coverage of diverse tasks and topics. Blind evaluation demonstrates that human-reformulated prompts significantly outperform synthetically generated prompts in naturalness and comprehensibility, particularly for low-resource languages like Icelandic and Faroese. For the bigger Scandinavian lan- guage, Danish, the difference was less pronounced. The prompt dataset is released under an open-source license at https://huggingface.co/datasets/AnnikaSimonsen/TrustLLM-reformulation-prompts.

2025

Segmenting languages based on morpheme boundaries instead of relying on language independent segmenting algorithms like Byte-Pair Encoding (BPE) has shown to benefit downstream Natural Language Processing (NLP) task performance. This can however be tricky for polysynthetic languages like Inuktitut due to a high morpheme-to-word ratio and the lack of appropriately sized annotated datasets. Through our work, we display the potential of using pre-trained Large Language Models (LLMs) for surface-level morphological segmentation of Inuktitut by treating it as a binary classification task. We fine-tune on tasks derived from automatically annotated Inuktitut words written in Inuktitut syllabics. Our approach shows good potential when compared to previous neural approaches. We share our best model to encourage further studies on down stream NLP tasks for Inuktitut written in syllabics.