CoMMA, a Large-scale Corpus of Multilingual Medieval Archives

Thibault Clérice, Simon Gabay, Malamatenia Vlachou-Efsthatiou, Ariane Pinche, Benoît Sagot


Abstract
We present CoMMA, a large-scale corpus of medieval manuscripts produced through automatic text recognition. The corpus contains around 2.5b tokens drawn from more than 23,000 digitized manuscripts in Latin and Old French, harvested via IIIF. Unlike other resources, it is made of raw, non-normalized text enriched with layout analysis in various formats. We describe the pipeline used for large-scale acquisition and processing, and report quantitative and qualitative evaluations (average CER 9.7%). The resulting resource supports multiple use cases, from pretraining language models to corpus linguistic on historical languages and digital humanities applications.
Anthology ID:
2026.lrec-main.560
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
7034–7045
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.560/
DOI:
Bibkey:
Cite (ACL):
Thibault Clérice, Simon Gabay, Malamatenia Vlachou-Efsthatiou, Ariane Pinche, and Benoît Sagot. 2026. CoMMA, a Large-scale Corpus of Multilingual Medieval Archives. International Conference on Language Resources and Evaluation, main:7034–7045.
Cite (Informal):
CoMMA, a Large-scale Corpus of Multilingual Medieval Archives (Clérice et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.560.pdf