Lucia Domenichelli
2026
Linguistic Profiling of Transformer Embedding Geometry
Lucia Domenichelli | Dominique Brunato | Felice Dell’Orletta
Proceedings of the 30th Conference on Computational Natural Language Learning
Lucia Domenichelli | Dominique Brunato | Felice Dell’Orletta
Proceedings of the 30th Conference on Computational Natural Language Learning
Transformer language models embed tokens in high-dimensional spaces, but whether geometry reflects linguistic structure remains unclear. We analyse token representations in BERT and GPT\mbox{-}2, selected as canonical encoder-only and decoder-only Transformer architectures, through a linguistically grounded geometric lens. We partition tokens from the UD English-EWT treebank by surface and syntactic features (position, length, POS, head distance and arity) and examine how their representational geometry evolves across layers. We employ complementary diagnostic metrics, including isotropy, linear and nonlinear intrinsic dimensionality, to capture distinct aspects of embedding structure. Our findings reveal that BERT maintains more isotropic and higher-dimensional subspaces, whereas GPT\mbox{-}2 exhibits stronger anisotropy driven by a compact cluster of sentence-initial tokens. Across models, open-class words, longer tokens, and high-arity predicates occupy more isotropic, higher-dimensional manifolds than short function words and pre-head modifiers, indicating that semantic richness and syntactic centrality play a key role in structuring embedding space. Our analysis provides a reusable framework for profiling how linguistic abstractions organize the geometry of Transformer embeddings.
2025
The Role of Eye-Tracking Data in Encoder-Based Models: An In-depth Linguistic Analysis
Lucia Domenichelli | Luca Dini | Dominique Brunato | Felice Dell’Orletta
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)
Lucia Domenichelli | Luca Dini | Dominique Brunato | Felice Dell’Orletta
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)
From Human Reading to NLM Understanding: Evaluating the Role of Eye-Tracking Data in Encoder-Based Models
Luca Dini | Lucia Domenichelli | Dominique Brunato | Felice Dell’Orletta
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Luca Dini | Lucia Domenichelli | Dominique Brunato | Felice Dell’Orletta
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Cognitive signals, particularly eye-tracking data, offer valuable insights into human language processing. Leveraging eye-gaze data from the Ghent Eye-Tracking Corpus, we conducted a series of experiments to examine how integrating knowledge of human reading behavior impacts Neural Language Models (NLMs) across multiple dimensions: task performance, attention mechanisms, and the geometry of their embedding space. We explored several fine-tuning methodologies to inject eye-tracking features into the models. Our results reveal that incorporating these features does not degrade downstream task performance, enhances alignment between model attention and human attention patterns, and compresses the geometry of the embedding space.