TRACE: Training and Inference-Time Interpretability Analysis for Language Models

Nura Aljaafari; Danilo Carvalho; André Freitas

TRACE: Training and Inference-Time Interpretability Analysis for Language Models

Nura Aljaafari, Danilo Carvalho, Andre Freitas

Abstract

Understanding when and how linguistic knowledge emerges during language model training remains a central challenge for interpretability. Most existing tools are post hoc, rely on scalar metrics, or require nontrivial integration effort, making comprehensive interpretability analysis difficult to deploy and maintain. We introduce TRACE, a modular toolkit for training and inference-time interpretability analysis of transformer models. It enables lightweight, in-training analysis of linguistic and representational signals, including features probing, intrinsic dimensionality, Hessian curvature, and output diagnostics. It integrates with ABSynth, a controllable synthetic corpus generator that provides structured annotations for precise evaluation of linguistic feature acquisition. Experiments with autoregressive transformers demonstrate that TRACE reveals developmental phenomena such as early syntactic emergence, delayed semantic acquisition, and representational compression, signals overlooked by traditional scalar metrics such as loss or accuracy. With minimal integration effort, the tool enables layer-wise diagnostics, convergence-based early stopping, and detection of structural errors, making transformer analysis interpretable, actionable, and reproducible.

Anthology ID:: 2025.emnlp-demos.62
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Ivan Habernal, Peter Schulam, Jörg Tiedemann
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 806–820
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-demos.62/
DOI:
Bibkey:
Cite (ACL):: Nura Aljaafari, Danilo Carvalho, and Andre Freitas. 2025. TRACE: Training and Inference-Time Interpretability Analysis for Language Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 806–820, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: TRACE: Training and Inference-Time Interpretability Analysis for Language Models (Aljaafari et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-demos.62.pdf

PDF Cite Search Fix data