Lucence Ing

2026

A Parallel Corpus of the Parable of the Prodigal Son: Building a Resource for Documenting Language Varieties in Mainland France
Lucence Ing | Juliette Janès | Sven Ködel | Benoît Sagot
Proceedings of the Fifteenth Language Resources and Evaluation Conference

This paper presents a historical parallel corpus of languages spoken in metropolitan France. It consists of a collection of versions of the Parable of the Prodigal Son, collected during the 19th century. The paper aims to present the interest of such a corpus, its constitution—through XML/TEI encoding, semi-automatic alignment and projection on linguistic maps—and its potential uses for the study of these low-resource languages.

pdf bib abs

Phrase-Level Segmentation on Medieval Corpora for Aligning Multilingual Texts
Lucence Ing | Matthias Gille Levenson | Carolina Macedo
Proceedings of the Fifteenth Language Resources and Evaluation Conference

This paper presents an approach to multilingual alignment for medieval languages, focusing on the prior step of"phrase" segmentation. It outlines the challenges posed by historical data and describes different strategies forsegmenting texts in multiple languages. It releases a gold-standard segmentation corpus based on various literaryand historical works from the late Middle Ages in Europe. This corpus consists of texts in seven medieval languages (French, Castilian, Catalan, Portuguese, Latin, Italian, English). Several architectures are tested with both in-domain and out-of-domain evaluation sets.

2025

pdf bib abs

Nous présentons COLaF, un projet dédié à la collecte et au développement d’outils et de ressources de traitement automatique des langues (TAL) pour le français et les autres langues de France, avec une attention particulière sur les langues et variétés moins dotées. Le projet concerne les données textuelles, audio et vidéo, afin de fournir des corpus et des outils pour le langage écrit, parlé et signé. Le projet inclut la collecte, la normalisation et la documentation de données préexistantes, y compris des données actuellement non accessibles ou non exploitables à des fins de recherche, ainsi que le développement d’outils de TAL adaptés à ces langues, comme des outils pour l’annotation linguistique et pour la traduction automatique. Cet article permet la présentation des principaux défis posés par le projet et de premiers résultats.