Jiamin Chen
2025
ALinFiK: Learning to Approximate Linearized Future Influence Kernel for Scalable Third-Parity LLM Data Valuation
Yanzhou Pan
|
Huawei Lin
|
Yide Ran
|
Jiamin Chen
|
Xiaodong Yu
|
Weijie Zhao
|
Denghui Zhang
|
Zhaozhuo Xu
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Large Language Models (LLMs) heavily rely on high-quality training data, making data valuation crucial for optimizing model performance, especially when working within a limited budget. In this work, we aim to offer a third-party data valuation approach that benefits both data providers and model developers. We introduce a linearized future influence kernel (LinFiK), which assesses the value of individual data samples in improving LLM performance during training. We further propose ALinFiK, a learning strategy to approximate LinFiK, enabling scalable data valuation. Our comprehensive evaluations demonstrate that this approach surpasses existing baselines in effectiveness and efficiency, demonstrating significant scalability advantages as LLM parameters increase.
2023
Rare Codes Count: Mining Inter-code Relations for Long-tail Clinical Text Classification
Jiamin Chen
|
Xuhong Li
|
Junting Xi
|
Lei Yu
|
Haoyi Xiong
Proceedings of the 5th Clinical Natural Language Processing Workshop
Multi-label clinical text classification, such as automatic ICD coding, has always been a challenging subject in Natural Language Processing, due to its long, domain-specific documents and long-tail distribution over a large label set. Existing methods adopt different model architectures to encode the clinical notes. Whereas without digging out the useful connections between labels, the model presents a huge gap in predicting performances between rare and frequent codes. In this work, we propose a novel method for further mining the helpful relations between different codes via a relation-enhanced code encoder to improve the rare code performance. Starting from the simple code descriptions, the model reaches comparable, even better performances than models with heavy external knowledge. Our proposed method is evaluated on MIMIC-III, a common dataset in the medical domain. It outperforms the previous state-of-art models on both overall metrics and rare code performances. Moreover, the interpretation results further prove the effectiveness of our methods. Our code is publicly available at https://github.com/jiaminchen-1031/Rare-ICD.
Search
Fix data
Co-authors
- Xuhong Li 1
- Huawei Lin 1
- Yanzhou Pan 1
- Yide Ran 1
- Junting Xi 1
- show all...