Integrating Quasi-symbolic Conceptual Knowledge into Language Model Pre-training

Gábor Berend


Abstract
In this paper, we investigate the integration of latent conceptual knowledge into the pre-training of masked language models. Our solution is based on the use of an auxiliary model, from which we extract training signals for training a student model. We determine the training signals from the hidden representations of the student model in an unsupervised way, using sparse coding. Models trained on latent concepts alone have an improved fine-tunability on downstream tasks, however, they perform worse on traditional language modeling, i.e., when the goal is to output missing tokens as opposed to latent semantic classes of words. In order to preserve the improved fine-tuning capability of the models, while making them better at the task of language modeling, we propose a final stage of pre-training, during which we perform traditional masked language modeling. The final stage of pre-training is based on a model that has already been pre-trained on the task of modeling latent semantic properties, with the weights of the backbone model being frozen. During the final training phase, we only train a lightweight linear classifier layer on top of the logits that the model determines for the latent semantic properties. With this modification, we can obtain the benefits of both the traditional training paradigms and the one which is based on the use of latent semantic properties. We release our source code at github.com/SzegedAI/MLSM.
Anthology ID:
2024.conll-babylm.13
Volume:
The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning
Month:
November
Year:
2024
Address:
Miami, FL, USA
Editors:
Michael Y. Hu, Aaron Mueller, Candace Ross, Adina Williams, Tal Linzen, Chengxu Zhuang, Leshem Choshen, Ryan Cotterell, Alex Warstadt, Ethan Gotlieb Wilcox
Venues:
CoNLL | BabyLM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
159–165
Language:
URL:
https://preview.aclanthology.org/Author-Pages-WenzhengZhang-ZhengyanShi-ShuYang/2024.conll-babylm.13/
DOI:
Bibkey:
Cite (ACL):
Gábor Berend. 2024. Integrating Quasi-symbolic Conceptual Knowledge into Language Model Pre-training. In The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning, pages 159–165, Miami, FL, USA. Association for Computational Linguistics.
Cite (Informal):
Integrating Quasi-symbolic Conceptual Knowledge into Language Model Pre-training (Berend, CoNLL-BabyLM 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/Author-Pages-WenzhengZhang-ZhengyanShi-ShuYang/2024.conll-babylm.13.pdf