Rethinking the Role of Text Complexity in Language Model Pretraining

Dan John Velasco; Matthew Theodore Roque

Rethinking the Role of Text Complexity in Language Model Pretraining

Dan John Velasco, Matthew Theodore Roque

Abstract

Improving pretraining data quality and size is known to boost downstream performance, but the role of text complexity—how hard a text is to read—remains less explored. We reduce surface-level complexity (shorter sentences, simpler words, simpler structure) while keeping core content approximately constant and ask: (i) How does complexity affect language modeling across model sizes? (ii) Can useful representations be learned from simpler text alone? (iii) How does pretraining text complexity influence downstream language understanding? We simplify human-written texts using a large language model, pretrain causal models (28M–500M) from scratch on original vs. simplified data, and evaluate them in fine-tuning and zero-shot setups. We find that perplexity is sensitive to the interaction between model capacity and text complexity—smaller models degrade far less on simpler texts—while text complexity has little impact on fine-tuning evaluations, with zero-shot evaluations indicating that simpler texts benefit performance on linguistic knowledge tasks, whereas more complex texts favor tasks requiring world knowledge and entity tracking. Our findings suggest that different types of data diversity affect transfer and zero-shot performance differently, providing insight into tailoring data curation to specific goals.

Anthology ID:: 2025.babylm-main.1
Volume:: Proceedings of the First BabyLM Workshop
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Lucas Charpentier, Leshem Choshen, Ryan Cotterell, Mustafa Omer Gul, Michael Y. Hu, Jing Liu, Jaap Jumelet, Tal Linzen, Aaron Mueller, Candace Ross, Raj Sanjay Shah, Alex Warstadt, Ethan Gotlieb Wilcox, Adina Williams
Venue:: BabyLM
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1–28
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.1/
DOI:
Bibkey:
Cite (ACL):: Dan John Velasco and Matthew Theodore Roque. 2025. Rethinking the Role of Text Complexity in Language Model Pretraining. In Proceedings of the First BabyLM Workshop, pages 1–28, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Rethinking the Role of Text Complexity in Language Model Pretraining (Velasco & Roque, BabyLM 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.1.pdf

PDF Cite Search Fix data