Abstract
This study extends previous research on literary quality by using information theory-based methods to assess the level of perplexity recorded by three large language models when processing 20th-century English novels deemed to have high literary quality, recognized by experts as canonical, compared to a broader control group. We find that canonical texts appear to elicit a higher perplexity in the models, we explore which textual features might concur to create such an effect. We find that the usage of a more heavily nominal style, together with a more diverse vocabulary, is one of the leading causes of the difference between the two groups. These traits could reflect “strategies” to achieve an informationally dense literary style.- Anthology ID:
- 2024.latechclfl-1.16
- Volume:
- Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)
- Month:
- March
- Year:
- 2024
- Address:
- St. Julians, Malta
- Editors:
- Yuri Bizzoni, Stefania Degaetano-Ortlieb, Anna Kazantseva, Stan Szpakowicz
- Venues:
- LaTeCHCLfL | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 172–184
- Language:
- URL:
- https://aclanthology.org/2024.latechclfl-1.16
- DOI:
- Cite (ACL):
- Yaru Wu, Yuri Bizzoni, Pascale Moreira, and Kristoffer Nielbo. 2024. Perplexing Canon: A study on GPT-based perplexity of canonical and non-canonical literary works. In Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024), pages 172–184, St. Julians, Malta. Association for Computational Linguistics.
- Cite (Informal):
- Perplexing Canon: A study on GPT-based perplexity of canonical and non-canonical literary works (Wu et al., LaTeCHCLfL-WS 2024)
- PDF:
- https://preview.aclanthology.org/add_acl24_videos/2024.latechclfl-1.16.pdf