Mélanie Jouitteau
2026
Prerequisites for Advancing Automatic Speech Recognition in Breton
Morgan Grobol | Alice Millour | Wassim Zemouri | Yuna Drapier | Mélanie Jouitteau
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Morgan Grobol | Alice Millour | Wassim Zemouri | Yuna Drapier | Mélanie Jouitteau
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We report on the extensive preliminary work of a collaborative science project aimed at developing Automatic Speech Recognition (ASR) for a minoritized European language: Breton. Hoping to help similar initiatives for other languages and communities, we present the methodology we developed for this specific ecosystem, with an estimate of the material and immaterial resources we used. Our approach is grounded in the needs and resources of the community formed by the end-users of digital development. Our multidisciplinary scientific collaboration involves linguists and speakers embedded in the academic and linguistic community, and computer scientists.
2024
ARBRES Kenstur: A Breton-French Parallel Corpus Rooted in Field Linguistics
Loïc Grobol | Mélanie Jouitteau
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Loïc Grobol | Mélanie Jouitteau
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
ARBRES is an ongoing project of open science implemented as a platform (“wikigrammar”) documenting both the Breton language itself and the state of research and engineering work in linguistics and NLP. Along its nearly 15 years of operation, it has aggregated a wealth of linguistic data in the form of interlinear glosses with translations illustrating lexical items, grammatical features, dialectal variations... While these glosses were primarily meant for human consumption, their volume and the regular format imposed by the wiki engine used for the website also make them suitable for machine processing. ARBRES Kenstur is a new parallel corpus derived from the glosses in ARBRES, including about 5k phrases and sentences in Breton along with translations in standard French. The nature of the original data — sourced from field linguistic inquiries meant to document the structure of Breton — leads to a resource that is mechanically more concerned with the internal variations of the language and rare phenomena than typical parallel corpora. Preliminaries experiments in using this corpus show that it can help improve machine translation for Breton, demonstrating that sourcing data from field linguistic documentation can be a way to help provide NLP tools for minority and low-resource languages.