Bridging Linguistic Structure and Mechanistic Interpretability for Conceptual Interpretation in Language Models

Nura Aljaafari, Danilo Carvalho, Andre Freitas


Abstract
Understanding how language models compose meaning from linguistic input remains a central problem in interpretability research. Mechanistic studies have attributed functional roles to core transformer components; however, these findings derive largely from factual retrieval settings. Whether the same mechanisms support conceptual interpretation, the compositional mapping from definitional expressions to abstract meaning, remains insufficiently characterised. We introduce DSRA (Definitional Semantic Role Analysis), a methodology that applies causal tracing within the reverse dictionary task and augments restoration traces with definitional semantic roles (DSRs) grounded in Argument Structure Theory. This linguistic overlay identifies which compositional functions (e.g., genus, differentia quality) are associated with high-recovery states, extending activation patching beyond token-level localisation. Applied to GPT-J-6B (English) and BERTIN GPT-J-6B (Spanish), the results show that MLP layers associate content-bearing tokens with high-specificity DSR categories in early layers, MHA layers distribute integration across middle-to-upper layers with concentration at the final token, and hidden states aggregate information in upper layers. Alignment between restored states and DSR categories indicates systematic correspondence between internal activations and definitional structure, with consistent localisation patterns across both languages.
Anthology ID:
2026.conll-main.44
Volume:
Proceedings of the 30th Conference on Computational Natural Language Learning
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Claire Bonial, Yevgeni Berzak
Venues:
CoNLL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
722–741
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.conll-main.44/
DOI:
Bibkey:
Cite (ACL):
Nura Aljaafari, Danilo Carvalho, and Andre Freitas. 2026. Bridging Linguistic Structure and Mechanistic Interpretability for Conceptual Interpretation in Language Models. In Proceedings of the 30th Conference on Computational Natural Language Learning, pages 722–741, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Bridging Linguistic Structure and Mechanistic Interpretability for Conceptual Interpretation in Language Models (Aljaafari et al., CoNLL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.conll-main.44.pdf