ManufactuBERT: Efficient Continual Pretraining for Manufacturing

Robin Armingaud, Romaric Besancon


Abstract
While large general-purpose Transformer-based encoders excel at general language understanding, their performance diminishes in specialized domains like manufacturing due to a lack of exposure to domain-specific terminology and semantics. In this paper, we address this gap by introducing ManufactuBERT, a RoBERTa model continually pretrained on a large-scale corpus curated for the manufacturing domain. We present a comprehensive data processing pipeline to create this corpus from web data, involving an initial domain-specific filtering step followed by a multi-stage deduplication process that removes redundancies. Our experiments show that ManufactuBERT establishes a new state-of-the-art on a range of manufacturing-related NLP tasks, outperforming strong specialized baselines. More importantly, we demonstrate that training on our carefully deduplicated corpus significantly accelerates convergence, leading to a 33% reduction in training time and computational cost compared to training on the non-deduplicated dataset. The proposed pipeline offers a reproducible example for developing high-performing encoders in other specialized domains. Our model, code and curated corpus will be publicly available.
Anthology ID:
2026.lrec-main.827
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
10545–10555
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.827/
DOI:
Bibkey:
Cite (ACL):
Robin Armingaud and Romaric Besancon. 2026. ManufactuBERT: Efficient Continual Pretraining for Manufacturing. International Conference on Language Resources and Evaluation, main:10545–10555.
Cite (Informal):
ManufactuBERT: Efficient Continual Pretraining for Manufacturing (Armingaud & Besancon, LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.827.pdf