Language Directions in Multilingual LLMs: A Layer-wise Diagnostic Study of Token Alignment and Pretraining Imprint

JaeSeong Kim; Suan Lee

Language Directions in Multilingual LLMs: A Layer-wise Diagnostic Study of Token Alignment and Pretraining Imprint

Abstract

We investigate how multilingual representations emerge across depth in large language models.Using a unified probing framework, we analyze six multilingual LLMs across five languages (EN/ES/ZH/FR/DE), decomposing behavior into (i) early-layer dynamics, (ii) linear vs. MLP separability, and (iii) token–language alignment that tracks where vocabulary sharing peaks.Across models, we observe a consistent and substantial early jump: accuracy rises by +73.5 to +80.7 points from L0 to L1 on average, indicating that language-relevant signals become accessible immediately after the embedding layer.Moreover, representations are largely linearly separable: for 5/6 models, the mean gap between MLP and linear probes remains within ±0.5 points.Token–language alignment further reveals systematic structure, with peak vocabulary mass exceeding 48% in some models and substantial variation in the depth of peak sharing.These findings provide a compact, cross-model characterization of how multilingual information is organized across depth and introduce simple alignment metrics that complement accuracy-based evaluation.

Anthology ID:: 2026.acl-srw.3
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Santosh T.Y.S.S., Juan Diego Rodriguez, Ona de Gibert
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 30–35
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.acl-srw.3/
DOI:
Bibkey:
Cite (ACL):: JaeSeong Kim and Suan Lee. 2026. Language Directions in Multilingual LLMs: A Layer-wise Diagnostic Study of Token Alignment and Pretraining Imprint. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 30–35, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Language Directions in Multilingual LLMs: A Layer-wise Diagnostic Study of Token Alignment and Pretraining Imprint (Kim & Lee, ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.acl-srw.3.pdf

PDF Cite Search Fix data