In Search of Lost Adventure Novels: Supervised Genre Retrieval and Corpus Refinement in Gallica

Jean Barré


Abstract
This paper addresses a practical problem in computational literary history: retrieving adventure novels from a large digitized collection of French fiction where genre metadata are sparse and unreliable. We begin with supervised genre modeling based on a historically situated seed list of 101 adventure novels drawn from literary scholarship. We compare several classifiers and representations, and validate them against 364 independently labeled adventure novels from the Chapitres corpus. The best-performing model, HistGradientBoosting on mean paragraph embeddings, achieves strong external recall (81%) despite the small training set. We then apply this model to the 12,176-novel Fictions littde Gallica archive and refine the resulting candidate corpus through a graph-based post-processing step over a k-nearest-neighbor similarity graph. On the Chapitres benchmark, this graph correction produces negligible changes in retrieval performance, indicating that it is not a generally superior classifier. On Gallica, however, it yields a more cohesive and homogeneous candidate corpus and surfaces interpretable correction cases, including missed canonical adventure novels and excluded borderline texts. We therefore argue that graph-based correction is best understood not as a replacement for supervised classification, but as a heuristic for refining large, noisy archival corpora where exhaustive manual annotation is impossible.
Anthology ID:
2026.nlp4dh-1.24
Volume:
Proceedings of the 6th International Conference on Natural Language Processing for the Digital Humanities
Month:
July
Year:
2026
Address:
San Diego, USA
Editors:
Sil Hamilton, Emily Öhman, Rebecca M. M. Hicke, Yuri Bizzoni, Axel Bax, Jacob A. Matthews, Mika Hämäläinen
Venues:
NLP4DH | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
255–263
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.nlp4dh-1.24/
DOI:
Bibkey:
Cite (ACL):
Jean Barré. 2026. In Search of Lost Adventure Novels: Supervised Genre Retrieval and Corpus Refinement in Gallica. In Proceedings of the 6th International Conference on Natural Language Processing for the Digital Humanities, pages 255–263, San Diego, USA. Association for Computational Linguistics.
Cite (Informal):
In Search of Lost Adventure Novels: Supervised Genre Retrieval and Corpus Refinement in Gallica (Barré, NLP4DH 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.nlp4dh-1.24.pdf