Phrase-Level Segmentation on Medieval Corpora for Aligning Multilingual Texts

Lucence Ing, Matthias Gille Levenson, Carolina Macedo


Abstract
This paper presents an approach to multilingual alignment for medieval languages, focusing on the prior step of"phrase" segmentation. It outlines the challenges posed by historical data and describes different strategies forsegmenting texts in multiple languages. It releases a gold-standard segmentation corpus based on various literaryand historical works from the late Middle Ages in Europe. This corpus consists of texts in seven medieval languages (French, Castilian, Catalan, Portuguese, Latin, Italian, English). Several architectures are tested with both in-domain and out-of-domain evaluation sets.
Anthology ID:
2026.lrec-main.72
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
936–946
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.72/
DOI:
Bibkey:
Cite (ACL):
Lucence Ing, Matthias Gille Levenson, and Carolina Macedo. 2026. Phrase-Level Segmentation on Medieval Corpora for Aligning Multilingual Texts. International Conference on Language Resources and Evaluation, main:936–946.
Cite (Informal):
Phrase-Level Segmentation on Medieval Corpora for Aligning Multilingual Texts (Ing et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.72.pdf