VLG-BERT: Towards Better Interpretability in LLMs through Visual and Linguistic Grounding

Toufik Mechouma, Ismail Biskri, Serge Robert


Abstract
We present VLG-BERT, a novel LLM model conceived to improve the language meaning encoding. VLG-BERT provides a deeper insights about meaning encoding in Large Language Models (LLMs) by focusing on linguistic and real-world semantics. It uses syntactic dependencies as a form of a ground truth to supervise the learning process of the words representation. VLG-BERT incorporates visual latent representations from pre-trained vision models and their corresponding labels. A vocabulary of 10k tokens corresponding to so-called concrete words is built by extending the set of ImageNet labels. The extension is based on synonyms, hyponyms and hypernyms from WordNet. Thus, a lookup table for this vocabulary is used to initialize the embedding matrix during training, rather than random initialization. This multimodal grounding provides a stronger semantic foundation for encoding the meaning of words. Its architecture aligns seamlessly with foundational theories from across the cognitive sciences. The integration of visual and linguistic grounding makes VLG-BERT consistent with many cognitive theories. Our approach contributes to the ongoing effort to create models that bridge the gap between language and vision, making them more aligned with how humans understand and interpret the world. Experiments on text classification have shown an excellent results compared to BERT Base.
Anthology ID:
2025.nlp4dh-1.47
Volume:
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities
Month:
May
Year:
2025
Address:
Albuquerque, USA
Editors:
Mika Hämäläinen, Emily Öhman, Yuri Bizzoni, So Miyagawa, Khalid Alnajjar
Venues:
NLP4DH | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
550–558
Language:
URL:
https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.nlp4dh-1.47/
DOI:
Bibkey:
Cite (ACL):
Toufik Mechouma, Ismail Biskri, and Serge Robert. 2025. VLG-BERT: Towards Better Interpretability in LLMs through Visual and Linguistic Grounding. In Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities, pages 550–558, Albuquerque, USA. Association for Computational Linguistics.
Cite (Informal):
VLG-BERT: Towards Better Interpretability in LLMs through Visual and Linguistic Grounding (Mechouma et al., NLP4DH 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.nlp4dh-1.47.pdf