Synthesising a Corpus of Gaelic Traditional Narrative with Cross-Lingual Text Expansion
William Lamb, Dongge Han, Ondrej Klejch, Beatrice Alex, Peter Bell
Abstract
Advances in large language modelling have disproportionately benefited high-resource languages due to their vastly greater training data reserves. This paper proposes a novel cross-lingual text expansion (XLTE) technique using multilingual large language models (MLLMs) to mitigate data sparsity in low-resource languages. We apply XLTE to the domain of traditional Scottish Gaelic storytelling to generate a training corpus suitable for language modelling, for example as part of an automatic speech recognition system. The effectiveness of this technique is demonstrated using OpenAI’s GPT-4o, with supervised fine-tuning (SFT) providing decreased neologism rates and a 57.2% reduction in perplexity over the baseline model. Despite these promising results, qualitative analyses reveal important stylistic divergences between synthesised and genuine data. Nevertheless, XLTE offers a promising, scalable method for synthesising training sets in other languages and domains, opening avenues for further improvements in low-resource language modelling.- Anthology ID:
- 2025.cltw-1.2
- Volume:
- Proceedings of the 5th Celtic Language Technology Workshop
- Month:
- January
- Year:
- 2025
- Address:
- Abu Dhabi [Virtual Workshop]
- Editors:
- Brian Davis, Theodorus Fransen, Elaine Uí Dhonnchadha, Abigail Walsh
- Venues:
- CLTW | WS
- SIG:
- Publisher:
- International Committee on Computational Linguistics
- Note:
- Pages:
- 12–26
- Language:
- URL:
- https://preview.aclanthology.org/jlcl-multiple-ingestion/2025.cltw-1.2/
- DOI:
- Cite (ACL):
- William Lamb, Dongge Han, Ondrej Klejch, Beatrice Alex, and Peter Bell. 2025. Synthesising a Corpus of Gaelic Traditional Narrative with Cross-Lingual Text Expansion. In Proceedings of the 5th Celtic Language Technology Workshop, pages 12–26, Abu Dhabi [Virtual Workshop]. International Committee on Computational Linguistics.
- Cite (Informal):
- Synthesising a Corpus of Gaelic Traditional Narrative with Cross-Lingual Text Expansion (Lamb et al., CLTW 2025)
- PDF:
- https://preview.aclanthology.org/jlcl-multiple-ingestion/2025.cltw-1.2.pdf