Hao Chen

Other people with similar names: Hao Chen , Hao Chen , Hao Chen , Hao Chen , Hao Chen , Hao Chen , Hao Chen , Hao Chen


2025

pdf bib
DIDS: Domain Impact-aware Data Sampling for Large Language Model Training
Weijie Shi | Jipeng Zhang | Yaguang Wu | Jingzhi Fang | Shibo Zhang | Yao Zhao | Hao Chen | Ruiyuan Zhang | Yue Cui | Jia Zhu | Sirui Han | Jiajie Xu | Xiaofang Zhou
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) are commonly trained on multi-domain datasets, where domain sampling strategies significantly impact model performance due to varying domain importance across downstream tasks. Existing approaches for optimizing domain-level sampling strategies struggle with maintaining intra-domain consistency and accurately measuring domain impact. In this paper, we present Domain Impact-aware Data Sampling (DIDS). To ensure intra-domain consistency, a gradient clustering algorithm is proposed to group training data based on their learning effects, where a proxy language model and dimensionality reduction are employed to reduce computational overhead. To accurately measure domain impact, we develop a Fisher Information Matrix (FIM) guided metric that quantifies how domain-specific parameter updates affect the model’s output distributions on downstream tasks, with theoretical guarantees. Furthermore, to determine optimal sampling ratios, DIDS combines both the FIM-guided domain impact assessment and loss learning trajectories that indicate domain-specific potential, while accounting for diminishing marginal returns. Extensive experiments demonstrate that DIDS achieves 3.4% higher average performance while maintaining comparable training efficiency. The code is available at https://github.com/shiweijiezero/DIDS.

pdf bib
Making RALM Robust to Irrelevant Contexts via Layer Knowledge Guided Attention
Weijie Shi | Hao Chen | Jiaming Li | Yao Zhao | Yazhong Zhang | Qijin Chen | Jipeng Zhang | Ruiyuan Zhang | Jia Zhu | Jiajie Xu | Xiaofang Zhou
Findings of the Association for Computational Linguistics: ACL 2025

Retrieval-augmented language models (RALMs) aim to incorporate external knowledge to address the issues of factual hallucination and knowledge obsolescence faced by large language models (LLMs). Inevitably, the retrieved passages based on similarity search may be irrelevant to the given question, and the aggregation of these passages can confuse the model to give a correct answer. To improve the performance of RALM in such conditions, we propose layer-knowledge guided attention for RALMs, which harnesses the layer-wise knowledge of LLMs to optimize per-layer attention on useful passages, making the model pay attention to the most relevant content and ignore irrelevant ones. Specifically, we first systematically study LLM’s attention patterns and their relationship with the accuracy of RALM responses, where middle-focus attentions play a crucial role in selectively gathering relevant information. Based on this, a layer-wise passage estimator leverages the varied knowledge encoded across LLM layers to assess not only passage relevance scores but also associated confidences. Finally, a relevance-aware passage fusion enables selective attention to relevant passages, mitigating distractibility and positional bias of causal attention. Experiments show that our method outperforms existing methods on RALM benchmarks.