Diversification Catalyzes Language Models’ Instruction Generalization To Unseen Semantics

Dylan Zhang, Justin Wang, Francois Charton


Abstract
Instruction-tuned language models excel in knowledge, reasoning, and instruction-following. While knowledge and reasoning are well-explored, the factors enabling generalization to unseen instructions remain underexplored due to challenges in isolating instruction-following dynamics.In this work, we model instruction-following as a computational process and design controlled experiments inspired by the Turing-complete Markov algorithm to disentangle its dynamics. Our findings reveal that the ability to generalize to instructions with unseen semantics emerges only when training data is strategically diversified across rich semantics. This finding gives us the hammer that breaks down the wall separating training instructions from unseen ones encountered in the wild. For specialist models, a balanced mix of in-domain and diverse out-of-domain tasks enhances performance more effectively than simply increasing in-domain data. For generalist models, domain diversification consistently outweighs the costs of reduced task-specific data, regardless of data budgets. Furthermore, we show that proper diversification with a lower data budget can outperform simply scaling up data volume. These findings highlight strategic data diversification as key to optimizing instruction-following and improving model performance across applications.
Anthology ID:
2025.findings-acl.1193
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venues:
Findings | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
23236–23249
Language:
URL:
https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.1193/
DOI:
Bibkey:
Cite (ACL):
Dylan Zhang, Justin Wang, and Francois Charton. 2025. Diversification Catalyzes Language Models’ Instruction Generalization To Unseen Semantics. In Findings of the Association for Computational Linguistics: ACL 2025, pages 23236–23249, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Diversification Catalyzes Language Models’ Instruction Generalization To Unseen Semantics (Zhang et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.1193.pdf