Target-Aware Language Modeling via Granular Data Sampling
Ernie Chang, Pin-Jie Lin, Yang Li, Changsheng Zhao, Daeil Kim, Rastislav Rabatin, Zechun Liu, Yangyang Shi, Vikas Chandra
Abstract
Language model pretraining generally targets a broad range of use cases and incorporates data from diverse sources. However, there are instances where we desire a model that excels in specific areas without markedly compromising performance in other areas. A cost-effective and straightforward approach is sampling with low-dimensional data features, which allows selecting large-scale pretraining data for domain-specific use cases. In this work, we revisit importance sampling with n-gram features consisting of multi-granular tokens, which strikes a good balance between sentence compression and representation capabilities. We observed the sampled data to have a high correlation with the target downstream task performance *while preserving its effectiveness on other tasks*. This leads to the proposed data sampling paradigm where language models can be pretrained more efficiently on selected documents. On eight benchmarks we demonstrate with ~1% of the data, pretrained models perform on par with the full RefinedWeb data and outperform randomly selected samples for model sizes ranging from 125M to 1.5B.- Anthology ID:
- 2024.emnlp-main.719
- Volume:
- Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 12927–12935
- Language:
- URL:
- https://preview.aclanthology.org/jlcl-multiple-ingestion/2024.emnlp-main.719/
- DOI:
- 10.18653/v1/2024.emnlp-main.719
- Cite (ACL):
- Ernie Chang, Pin-Jie Lin, Yang Li, Changsheng Zhao, Daeil Kim, Rastislav Rabatin, Zechun Liu, Yangyang Shi, and Vikas Chandra. 2024. Target-Aware Language Modeling via Granular Data Sampling. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 12927–12935, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- Target-Aware Language Modeling via Granular Data Sampling (Chang et al., EMNLP 2024)
- PDF:
- https://preview.aclanthology.org/jlcl-multiple-ingestion/2024.emnlp-main.719.pdf