Sriharsh Bhyravajjula
2025
so much depends / upon / a whitespace: Why Whitespace Matters for Poets and LLMs
Sriharsh Bhyravajjula
|
Melanie Walsh
|
Anna Preus
|
Maria Antoniak
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Whitespace is a critical component of poetic form, reflecting both adherence to standardized forms and rebellion against those forms. Each poem’s whitespace distribution reflects the artistic choices of the poet and is an integral semantic and spatial feature of the poem. Yet, despite the popularity of poetry as both a long-standing art form and as a generation task for large language models (LLMs), whitespace has not received sufficient attention from the NLP community. Using a corpus of 19k English-language published poems from Poetry Foundation, we investigate how 4k poets have used whitespace in their works. We release a subset of 2.8k public-domain poems with preserved formatting to facilitate further research in this area. We compare whitespace usage in the published poems to (1) 51k LLM-generated poems, and (2) 12k unpublished poems posted in an online community. We also explore whitespace usage across time periods, poetic forms, and data sources. Additionally, we find that different text processing methods can result in significantly different representations of whitespace in poetry data, motivating us to use these poems and whitespace patterns to discuss implications for the processing strategies used to assemble pretraining datasets for LLMs.