3DS: Medical Domain Adaptation of LLMs via Decomposed Difficulty-based Data Selection
Hongxin Ding, Yue Fang, Runchuan Zhu, Xinke Jiang, Jinyang Zhang, Yongxin Xu, Weibin Liao, Xu Chu, Junfeng Zhao, Yasha Wang
Abstract
Large Language Models (LLMs) excel in general language tasks, motivating their adaptation to specialized domains such as healthcare. Effective domain adaptation typically involves supervised fine-tuning (SFT) on carefully selected instruction-tuning data. Current data selection methods adopt a data-centric approach, relying on external annotations and heuristics to identify externally defined high-quality or challenging data. Our exploratory experiments highlight this approach fails to improve the model’s domain performance, due to misalignment between selected data and the model’s knowledge distribution. To tackle this, we propose Decomposed Difficulty-based Data Selection (3DS), a two-stage model-centric data selection framework that aligns data selection with the model’s distribution. 3DS employs Prompt-Driven Data Selection to filter out noise based on the model’s knowledge via explicit alignment in Stage#1, then adopts Decomposed Difficulty-based Data Selection to guide selection via three novel data difficulty metrics, including Instruction Understanding, Response Confidence, and Response Correctness in Stage#2, enhanced by an attention-based importance weighting mechanism for accurate calibration.Extensive experiments in the healthcare domain show 3DS outperforms existing methods by up to 2.97% accuracy, with additional validation in law and general domains, confirming its generalization ability. Our dataset and code are open-sourced at https://github.com/PuppyKnightUniversity/3DS.- Anthology ID:
- 2025.emnlp-main.983
- Volume:
- Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 19473–19495
- Language:
- URL:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.983/
- DOI:
- Cite (ACL):
- Hongxin Ding, Yue Fang, Runchuan Zhu, Xinke Jiang, Jinyang Zhang, Yongxin Xu, Weibin Liao, Xu Chu, Junfeng Zhao, and Yasha Wang. 2025. 3DS: Medical Domain Adaptation of LLMs via Decomposed Difficulty-based Data Selection. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19473–19495, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- 3DS: Medical Domain Adaptation of LLMs via Decomposed Difficulty-based Data Selection (Ding et al., EMNLP 2025)
- PDF:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.983.pdf