Juliette Janès

Also published as: Juliette Janes


2026

This paper presents a historical parallel corpus of languages spoken in metropolitan France. It consists of a collection of versions of the Parable of the Prodigal Son, collected during the 19th century. The paper aims to present the interest of such a corpus, its constitution—through XML/TEI encoding, semi-automatic alignment and projection on linguistic maps—and its potential uses for the study of these low-resource languages.
We introduce ForumOccitania, a new Occitan corpus of posts from an online forum, covering a range of topics and dialects. While some existing datasets for this low-resource language include labels of varieties within the dialect continuum, we go one step further by providing metadata pertaining to sociolinguistic factors of language variation (dialect, geographical location, age, proficiency), extracted from self-declared user profiles. We carry out statistical and qualitative analyses, as well as preliminary experiments on unsupervised dialect identification. Our results show that (i) most of the contents is written in Occitan, with the classical spelling conventions, and by young speakers, (ii) posts display a strong presence of dialectal features from four major Occitan varieties (Lemosin, Lengadocian, Gascon, Provençau), and (iii) a simple topic modelling approach introduced by Kuparinen and Scherrer (2024) effectively detects salient features of these four varieties, but also reveals finer-grained diatopical variation tendencies.

2025

Nous présentons COLaF, un projet dédié à la collecte et au développement d’outils et de ressources de traitement automatique des langues (TAL) pour le français et les autres langues de France, avec une attention particulière sur les langues et variétés moins dotées. Le projet concerne les données textuelles, audio et vidéo, afin de fournir des corpus et des outils pour le langage écrit, parlé et signé. Le projet inclut la collecte, la normalisation et la documentation de données préexistantes, y compris des données actuellement non accessibles ou non exploitables à des fins de recherche, ainsi que le développement d’outils de TAL adaptés à ces langues, comme des outils pour l’annotation linguistique et pour la traduction automatique. Cet article permet la présentation des principaux défis posés par le projet et de premiers résultats.

2024

Whether or not several Creole languages which developed during the early modern period can be considered genetic descendants of European languages has been the subject of intense debate. This is in large part due to the absence of evidence of intermediate forms. This work introduces a new open corpus, the Molyé corpus, which combines stereotypical representations of three kinds of language variation in Europe with early attestations of French-based Creole languages across a period of 400 years. It is intended to facilitate future research on the continuity between contact situations in Europe and Creolophone (former) colonies.