Data Drives Unstable Hierarchical Generalization in LMs

Tian Qin, Naomi Saphra, David Alvarez-Melis


Abstract
Early in training, LMs can behave like n-gram models, but eventually, they often learn tree-based syntactic rules and generalize hierarchically out of distribution (OOD). We study this shift using controlled grammar-learning tasks: question formation and tense inflection. We find a model learns to generalize hierarchically if its training data is *complex*–in particular, if it includes center-embedded clauses, a special syntactic structure. Under this definition, complex data drives hierarchical rules, while less complex data encourages shortcut learning in the form of n-gram-like linear rules. Furthermore, we find that a model uses rules to generalize, whether hierarchical or linear, if its training data is *diverse*–in particular, if it includes many distinct syntax trees in the training set. Under this definition, diverse data promotes stable rule learning, whereas less diverse data promotes memorization of individual syntactic sequences. Finally, intermediate diversity and intermediate complexity form an *unstable regime*, which is characterized by oscillatory learning dynamics and inconsistent behaviors across random seeds. These results highlight the central role of training data in shaping generalization and explain why competing strategies can lead to unstable outcomes.
Anthology ID:
2025.emnlp-main.593
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11733–11751
Language:
URL:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.emnlp-main.593/
DOI:
10.18653/v1/2025.emnlp-main.593
Bibkey:
Cite (ACL):
Tian Qin, Naomi Saphra, and David Alvarez-Melis. 2025. Data Drives Unstable Hierarchical Generalization in LMs. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 11733–11751, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Data Drives Unstable Hierarchical Generalization in LMs (Qin et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.emnlp-main.593.pdf
Checklist:
 2025.emnlp-main.593.checklist.pdf