CorpusClues: Scalable Unsupervised Similarity Search for Historical Texts Using MinHash-LSH

Paulien Lemay, Klaas Bentein, Els Lefever


Abstract
CorpusClues is a prototype web-based platform for large-scale, unsupervised clustering of textual data, designed to address the specific challenges of historical corpora. It leverages the well-established computational techniques of MinHash and Locality-Sensitive Hashing (LSH) at the character level in order to detect structural similarities between texts even when exact patterns diverge. This approach makes CorpusClues robust to orthographic variation, such as historical spelling differences, while remaining fast and language-agnostic, capable of processing large and heterogeneous corpora without relying on language-specific models or preprocessing. Researchers can explore resulting clusters through interactive visualizations and exportable data, gaining access to patterns that would otherwise require the slow and uncertain process of manual collation. Evaluation against labeled gold standards shows that the system consistently produces high-quality clustering, accurately reconstructing relationships between texts despite substantial orthographic variation. By combining computational efficiency with user-friendly design, CorpusClues provides an accessible yet rigorous means of uncovering formulaicity and textual transmission at scale, opening new possibilities for the study of historical textual traditions.
Anthology ID:
2026.lrec-main.63
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
838–847
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.63/
DOI:
Bibkey:
Cite (ACL):
Paulien Lemay, Klaas Bentein, and Els Lefever. 2026. CorpusClues: Scalable Unsupervised Similarity Search for Historical Texts Using MinHash-LSH. International Conference on Language Resources and Evaluation, main:838–847.
Cite (Informal):
CorpusClues: Scalable Unsupervised Similarity Search for Historical Texts Using MinHash-LSH (Lemay et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.63.pdf
Optionalsupplementarymaterial:
 2026.lrec-main.63.OptionalSupplementaryMaterial.zip