Don’t stop pretraining! Efficiently building specialised language models in resource-constrained settings.

Sven Najem-Meyer, Frédéric Kaplan, Matteo Romanello


Abstract
Developing specialised language models for low-resource domains typically involves a trade-off between two specialisation strategies: adapting a general-purpose model through continued pretraining or retraining a model from scratch. While adapting preserves the model’s linguistic knowledge, retraining benefits from the flexibility of an in-domain tokeniser – a potentially significant advantage when handling rare languages. This study investigates the impact of tokenisation, specialisation strategy, and pretraining data availability using classical scholarship – a multilingual, code-switching and highly domain-specific field – as a case study. Through extensive experiments, we assess whether domain-specific tokenisation improves model performance, whether character-based models provide a viable alternative to subword-based models, and which specialisation strategy is optimal given the constraints of limited pretraining data. Contrary to prior findings, our results show that in-domain tokenisation does not necessarily enhance performance. Most notably, adaptation consistently outperforms retraining, even with limited data, confirming its efficiency as the preferred strategy for resource-constrained domains. These insights provide valuable guidelines for developing specialised models in fields with limited textual resources.
Anthology ID:
2025.latechclfl-1.22
Volume:
Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025)
Month:
May
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Anna Kazantseva, Stan Szpakowicz, Stefania Degaetano-Ortlieb, Yuri Bizzoni, Janis Pagel
Venues:
LaTeCHCLfL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
252–260
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.latechclfl-1.22/
DOI:
Bibkey:
Cite (ACL):
Sven Najem-Meyer, Frédéric Kaplan, and Matteo Romanello. 2025. Don’t stop pretraining! Efficiently building specialised language models in resource-constrained settings.. In Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025), pages 252–260, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Don’t stop pretraining! Efficiently building specialised language models in resource-constrained settings. (Najem-Meyer et al., LaTeCHCLfL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.latechclfl-1.22.pdf