2025
pdf
bib
abs
Rethinking KenLM: Good and Bad Model Ensembles for Efficient Text Quality Filtering in Large Web Corpora
Yungi Kim
|
Hyunsoo Ha
|
Sukyung Lee
|
Jihoo Kim
|
Seonghoon Yang
|
Chanjun Park
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
With the increasing demand for substantial amounts of high-quality data to train large language models (LLMs), efficiently filtering large web corpora has become a critical challenge. For this purpose, KenLM, a lightweight n-gram-based language model that operates on CPUs, is widely used. However, the traditional method of training KenLM utilizes only high-quality data and, consequently, does not explicitly learn the linguistic patterns of low-quality data. To address this issue, we propose an ensemble approach that leverages two contrasting KenLMs: (i) Good KenLM, trained on high-quality data; and (ii) Bad KenLM, trained on low-quality data. Experimental results demonstrate that our approach significantly reduces noisy content while preserving high-quality content compared to the traditional KenLM training method. This indicates that our method can be a practical solution with minimal computational overhead for resource-constrained environments.
2024
pdf
bib
abs
SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling
Sanghoon Kim
|
Dahyun Kim
|
Chanjun Park
|
Wonsung Lee
|
Wonho Song
|
Yunsu Kim
|
Hyeonwoo Kim
|
Yungi Kim
|
Hyeonju Lee
|
Jihoo Kim
|
Changbae Ahn
|
Seonghoon Yang
|
Sukyung Lee
|
Hyunbyung Park
|
Gyoungjin Gim
|
Mikyoung Cha
|
Hwalsuk Lee
|
Sunghun Kim
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track)
We introduce SOLAR 10.7B, a large language model (LLM) with 10.7 billion parameters, demonstrating superior performance in various natural language processing (NLP) tasks. Inspired by recent efforts to efficiently up-scale LLMs, we present a method for scaling LLMs called depth up-scaling (DUS), which encompasses depthwise scaling and continued pretraining. In contrast to other LLM up-scaling methods that use mixture-of-experts, DUS does not require complex changes to train and inference efficiently. We show experimentally that DUS is simple yet effective in scaling up high-performance LLMs from small ones. Building on the DUS model, we additionally present SOLAR 10.7B-Instruct, a variant fine-tuned for instruction-following capabilities, surpassing Mixtral-8x7B-Instruct. SOLAR 10.7B is publicly available under the Apache 2.0 license, promoting broad access and application in the LLM field.