VersaTune: An Efficient Data Composition Framework for Training Multi-Capability LLMs

Keer Lu; Keshi Zhao; Zhuoran Zhang; Zheng Liang; Bin Cui; Tengjiao Wang; Wentao Zhang

VersaTune: An Efficient Data Composition Framework for Training Multi-Capability LLMs

Keer Lu, Keshi Zhao, Zhuoran Zhang, Zheng Liang, Bin Cui, Tengjiao Wang, Wentao Zhang

Abstract

As demonstrated by the proprietary Large Language Models (LLMs) such as GPT and Claude series, LLMs have the potential to achieve remarkable proficiency across a wide range of domains, including law, medicine, finance, science, code, etc., all within a single model. These capabilities are further augmented during the Supervised Fine-Tuning (SFT) phase. Despite their potential, existing work mainly focuses on domain-specific enhancements during fine-tuning, the challenge of which lies in catastrophic forgetting of knowledge across other domains. In this study, we introduce **VersaTune**, a novel data composition framework designed for enhancing LLMs’ overall multi-domain capabilities during training. We begin with detecting the distribution of domain-specific knowledge within the base model, followed by the training data composition that aligns with the model’s existing knowledge distribution. During the subsequent training process, domain weights are dynamically adjusted based on their learnable potential and forgetting degree. Experimental results indicate that VersaTune is effective in multi-domain fostering, with an improvement of 29.77% in the overall multi-ability performances compared to uniform domain weights. Furthermore, we find that Qwen-2.5-32B + VersaTune even surpasses frontier models, including GPT-4o, Claude3.5-Sonnet and DeepSeek-V3 by 0.86%, 4.76% and 4.60%. Additionally, in scenarios where flexible expansion of a specific domain is required, VersaTune reduces the performance degradation in other domains by 38.77%, while preserving the training efficacy of the target domain.

Anthology ID:: 2025.emnlp-main.337
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6645–6669
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.337/
DOI:
Bibkey:
Cite (ACL):: Keer Lu, Keshi Zhao, Zhuoran Zhang, Zheng Liang, Bin Cui, Tengjiao Wang, and Wentao Zhang. 2025. VersaTune: An Efficient Data Composition Framework for Training Multi-Capability LLMs. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6645–6669, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: VersaTune: An Efficient Data Composition Framework for Training Multi-Capability LLMs (Lu et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.337.pdf
Checklist:: 2025.emnlp-main.337.checklist.pdf

PDF Cite Search Checklist Fix data