Identifying intertextual relationships between authors is of central importance to the study of literature. We report an empirical analysis of intertextuality in classical Latin literature using word embedding models. To enable quantitative evaluation of intertextual search methods, we curate a new dataset of 945 known parallels drawn from traditional scholarship on Latin epic poetry. We train an optimized word2vec model on a large corpus of lemmatized Latin, which achieves state-of-the-art performance for synonym detection and outperforms a widely used lexical method for intertextual search. We then demonstrate that training embeddings on very small corpora can capture salient aspects of literary style and apply this approach to replicate a previous intertextual study of the Roman historian Livy, which relied on hand-crafted stylometric features. Our results advance the development of core computational resources for a major premodern language and highlight a productive avenue for cross-disciplinary collaboration between the study of literature and NLP.
Classification of texts by genre is an important application of natural language processing to literary corpora but remains understudied for premodern and non-English traditions. We develop a stylometric feature set for ancient Greek that enables identification of texts as prose or verse. The set contains over 20 primarily syntactic features, which are calculated according to custom, language-specific heuristics. Using these features, we classify almost all surviving classical Greek literature as prose or verse with >97% accuracy and F1 score, and further classify a selection of the verse texts into the traditional genres of epic and drama.
Computational stylometry has become an increasingly important aspect of literary criticism, but many humanists lack the technical expertise or language-specific NLP resources required to exploit computational methods. We demonstrate a stylometry toolkit for analysis of Latin literary texts, which is freely available at www.qcrit.org/stylometry. Our toolkit generates data for a diverse range of literary features and has an intuitive point-and-click interface. The features included have proven effective for multiple literary studies and are calculated using custom heuristics without the need for syntactic parsing. As such, the toolkit models one approach to the user-friendly generation of stylometric data, which could be extended to other premodern and non-English languages underserved by standard NLP resources.