CoMMA, a Large-scale Corpus of Multilingual Medieval Archives
Thibault Clérice, Simon Gabay, Malamatenia Vlachou-Efsthatiou, Ariane Pinche, Benoît Sagot
Abstract
We present CoMMA, a large-scale corpus of medieval manuscripts produced through automatic text recognition. The corpus contains around 2.5b tokens drawn from more than 23,000 digitized manuscripts in Latin and Old French, harvested via IIIF. Unlike other resources, it is made of raw, non-normalized text enriched with layout analysis in various formats. We describe the pipeline used for large-scale acquisition and processing, and report quantitative and qualitative evaluations (average CER 9.7%). The resulting resource supports multiple use cases, from pretraining language models to corpus linguistic on historical languages and digital humanities applications.- Anthology ID:
- 2026.lrec-main.560
- Volume:
- Proceedings of the Fifteenth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2026
- Address:
- Palma de Mallorca, Spain
- Editors:
- Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
- Venue:
- LREC
- SIG:
- Publisher:
- ELRA Language Resource Association
- Note:
- Pages:
- 7034–7045
- Language:
- URL:
- https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.560/
- DOI:
- Cite (ACL):
- Thibault Clérice, Simon Gabay, Malamatenia Vlachou-Efsthatiou, Ariane Pinche, and Benoît Sagot. 2026. CoMMA, a Large-scale Corpus of Multilingual Medieval Archives. International Conference on Language Resources and Evaluation, main:7034–7045.
- Cite (Informal):
- CoMMA, a Large-scale Corpus of Multilingual Medieval Archives (Clérice et al., LREC 2026)
- PDF:
- https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.560.pdf