Don’t stop pretraining! Efficiently building specialised language models in resource-constrained settings.

Sven Najem-Meyer; Frédéric Kaplan; Matteo Romanello

Don’t stop pretraining! Efficiently building specialised language models in resource-constrained settings.

Sven Najem-Meyer, Frédéric Kaplan, Matteo Romanello

Abstract

Developing specialised language models for low-resource domains typically involves a trade-off between two specialisation strategies: adapting a general-purpose model through continued pretraining or retraining a model from scratch. While adapting preserves the model’s linguistic knowledge, retraining benefits from the flexibility of an in-domain tokeniser – a potentially significant advantage when handling rare languages. This study investigates the impact of tokenisation, specialisation strategy, and pretraining data availability using classical scholarship – a multilingual, code-switching and highly domain-specific field – as a case study. Through extensive experiments, we assess whether domain-specific tokenisation improves model performance, whether character-based models provide a viable alternative to subword-based models, and which specialisation strategy is optimal given the constraints of limited pretraining data. Contrary to prior findings, our results show that in-domain tokenisation does not necessarily enhance performance. Most notably, adaptation consistently outperforms retraining, even with limited data, confirming its efficiency as the preferred strategy for resource-constrained domains. These insights provide valuable guidelines for developing specialised models in fields with limited textual resources.

Anthology ID:: 2025.latechclfl-1.22
Volume:: Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025)
Month:: May
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Anna Kazantseva, Stan Szpakowicz, Stefania Degaetano-Ortlieb, Yuri Bizzoni, Janis Pagel
Venues:: LaTeCHCLfL | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 252–260
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2025.latechclfl-1.22/
DOI:
Bibkey:
Cite (ACL):: Sven Najem-Meyer, Frédéric Kaplan, and Matteo Romanello. 2025. Don’t stop pretraining! Efficiently building specialised language models in resource-constrained settings.. In Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025), pages 252–260, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: Don’t stop pretraining! Efficiently building specialised language models in resource-constrained settings. (Najem-Meyer et al., LaTeCHCLfL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2025.latechclfl-1.22.pdf

PDF Cite Search Fix data