LongWanjuan: Towards Systematic Measurement for Long Text Quality
Xiaoran Liu, Kai Lv, Qipeng Guo, Hang Yan, Conghui He, Xipeng Qiu, Dahua Lin
Abstract
The quality of training data is crucial for enhancing the long-text capabilities of foundation models. Despite existing efforts to refine data quality through heuristic rules and evaluations based on data diversity and difficulty, there’s a lack of systematic approaches specifically tailored for assessing long texts. Addressing this gap, our work systematically measures the quality of long texts by evaluating three fundamental linguistic dimensions: coherence, cohesion, and complexity. Drawing inspiration from the aforementioned three dimensions, we introduce a suite of metrics designed to evaluate the quality of long texts, encompassing both statistical and pre-trained language model-based ones. Leveraging these metrics, we present LongWanjuan, a bilingual dataset specifically tailored to enhance the training of language models for long-text tasks with over 160B tokens. In LongWanjuan, we categorize long texts into holistic, aggregated, and chaotic types, enabling a detailed analysis of long-text quality. Furthermore, we devise a data mixture recipe that strategically balances different types of long texts within LongWanjuan, leading to significant improvements in model performance on long-text tasks.- Anthology ID:
- 2024.findings-emnlp.327
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2024
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5709–5725
- Language:
- URL:
- https://preview.aclanthology.org/jlcl-multiple-ingestion/2024.findings-emnlp.327/
- DOI:
- 10.18653/v1/2024.findings-emnlp.327
- Cite (ACL):
- Xiaoran Liu, Kai Lv, Qipeng Guo, Hang Yan, Conghui He, Xipeng Qiu, and Dahua Lin. 2024. LongWanjuan: Towards Systematic Measurement for Long Text Quality. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 5709–5725, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- LongWanjuan: Towards Systematic Measurement for Long Text Quality (Liu et al., Findings 2024)
- PDF:
- https://preview.aclanthology.org/jlcl-multiple-ingestion/2024.findings-emnlp.327.pdf