Word-level Cross-lingual Structure in Large Language Models

Zihao Feng; Hailong Cao; Wang Xu; Tiejun Zhao (赵铁军)

Word-level Cross-lingual Structure in Large Language Models

Zihao Feng, Hailong Cao, Wang Xu, Tiejun Zhao

Abstract

Large Language Models (LLMs) have demonstrated exceptional performance across a broad spectrum of cross-lingual Natural Language Processing (NLP) tasks. However, previous methods predominantly focus on leveraging parallel corpus to conduct instruction data for continuing pre-training or fine-tuning. They ignored the state of parallel data on the hidden layers of LLMs. In this paper, we demonstrate Word-level Cross-lingual Structure (WCS) of LLM which proves that the word-level embedding on the hidden layers are isomorphic between languages. We find that the hidden states of different languages’ input on the LLMs hidden layers can be aligned with an orthogonal matrix on word-level. We prove this conclusion in both mathematical and downstream task ways on two representative LLM foundations, LLaMA2 and BLOOM. Besides, we propose an Isomorphism-based Data Augmentation (IDA) method to apply the WCS on a downstream cross-lingual task, Bilingual Lexicon Induction (BLI), in both supervised and unsupervised ways. The experiment shows the significant improvement of our proposed method over all the baselines, especially on low-resource languages.

Anthology ID:: 2025.coling-main.138
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2026–2037
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2025.coling-main.138/
DOI:
Bibkey:
Cite (ACL):: Zihao Feng, Hailong Cao, Wang Xu, and Tiejun Zhao. 2025. Word-level Cross-lingual Structure in Large Language Models. In Proceedings of the 31st International Conference on Computational Linguistics, pages 2026–2037, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: Word-level Cross-lingual Structure in Large Language Models (Feng et al., COLING 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2025.coling-main.138.pdf

PDF Cite Search Fix data