Laurent Kevers


2024

pdf
Agettivu, Aggitivu o Aghjettivu? POS Tagging Corsican Dialects
Alice Millour | Lorenza Brasile | Alberto Ghia | Laurent Kevers
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In this paper we present a series of experiments towards POS tagging Corsican, a less-resourced language spoken in Corsica and linguistically related to Italian. The first contribution is Corsican-POS, the first gold standard POS-tagged corpus for Corsica, composed of 500 sentences manually annotated with the Universal POS tagset. Our second contribution is a set of experiments and evaluation of POS tagging models which starts with a baseline model for Italian and is aimed at finding the best training configuration, namely in terms of the size and combination strategy of the existing raw and annotated resources. These experiments result in (i) the first POS tagger for Corsican, reaching an accuracy of 93.38%, (ii) a quantification of the gain provided by the use of each available resource. We find that the optimal configuration uses Italian word embeddings further specialized with Corsican embeddings and trained on the largest gold corpus for Corsican available so far.

pdf
The ParCoLab Parallel Corpus and Its Extension to Four Regional Languages of France
Dejan Stosic | Saša Marjanović | Delphine Bernhard | Xavier Bach | Myriam Bras | Laurent Kevers | Stella Retali-Medori | Marianne Vergez-Couret | Carole Werner
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Parallel corpora are still scarce for most of the world’s language pairs. The situation is by no means different for regional languages of France. In addition, adequate web interfaces facilitate and encourage the use of parallel corpora by target users, such as language learners and teachers, as well as linguists. In this paper, we describe ParCoLab, a parallel corpus and a web platform for querying the corpus. From its onset, ParCoLab has been geared towards lower-resource languages, with an initial corpus in Serbian, along with French and English (later Spanish). We focus here on the extension of ParCoLab with a parallel corpus for four regional languages of France: Alsatian, Corsican, Occitan and Poitevin-Saintongeais. In particular, we detail criteria for choosing texts and issues related to their collection. The new parallel corpus contains more than 20k tokens per regional language.

2022

pdf
CoSwID, a Code Switching Identification Method Suitable for Under-Resourced Languages
Laurent Kevers
Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages

We propose a method for identifying monolingual textual segments in multilingual documents. It requires only a minimal number of linguistic resources – word lists and monolingual corpora – and can therefore be adapted to many under-resourced languages. Taking these languages into account when processing multilingual documents in NLP tools is important as it can contribute to the creation of essential textual resources. This language identification task – code switching detection being its most complex form – can also provide added value to various existing data or tools. Our research demonstrates that a language identification module performing well on short texts can be used to efficiently analyse a document through a sliding window. The results obtained for code switching identification – between 87.29% and 97.97% accuracy – are state-of-the-art, which is confirmed by the benchmarks performed on the few available systems that have been used on our test data.

2021

pdf bib
L’identification de langue, un outil au service du corse et de l’évaluation des ressources linguistiques [Language identification, a tool for Corsican and for the evaluation of linguistic resources]
Laurent Kevers
Traitement Automatique des Langues, Volume 62, Numéro 3 : Diversité Linguistique [Linguistic Diversity in Natural Language Processing]

2020

pdf
Towards a Corsican Basic Language Resource Kit
Laurent Kevers | Stella Retali-Medori
Proceedings of the Twelfth Language Resources and Evaluation Conference

The current situation regarding the existence of natural language processing (NLP) resources and tools for Corsican reveals their virtual non-existence. Our inventory contains only a few rare digital resources, lexical or corpus databases, requiring adaptation work. Our objective is to use the Banque de Données Langue Corse project (BDLC) to improve the availability of resources and tools for the Corsican language and, in the long term, provide a complete Basic Language Ressource Kit (BLARK). We have defined a roadmap setting out the actions to be undertaken: the collection of corpora and the setting up of a consultation interface (concordancer), and of a language detection tool, an electronic dictionary and a part-of-speech tagger. The first achievements regarding these topics have already been reached and are presented in this article. Some elements are also available on our project page (http://bdlc.univ-corse.fr/tal/).

2019

pdf
Outiller une langue peu dotée grâce au TALN : l’exemple du corse et de la BDLC (Tooling up a less-resourced language with NLP : the example of Corsican and BDLC)
Laurent Kevers | Florian Guéniot | Aurelia Ghjacumina Tognotti | Stella Retali-Medori
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume II : Articles courts

Nos recherches sur la langue corse nous amènent naturellement à envisager l’utilisation d’outils pour le traitement automatique du langage. Après une brève introduction sur le corse et sur le projet qui constitue notre cadre de travail, nous proposons un état des lieux concernant l’application du TAL aux langues peu dotées, dont le corse. Nous définissons ensuite les actions qui peuvent être entreprises, ainsi que la manière dont elles peuvent s’intégrer dans le cadre de notre projet, afin de progresser vers la constitution de ressources et la construction d’outils pour le TAL corse.

2006

pdf
L’information biographique : modélisation, extraction et organisation en base de connaissances
Laurent Kevers
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. REncontres jeunes Chercheurs en Informatique pour le Traitement Automatique des Langues

L’extraction et la valorisation de données biographiques contenues dans les dépêches de presse est un processus complexe. Pour l’appréhender correctement, une définition complète, précise et fonctionnelle de cette information est nécessaire. Or, la difficulté que l’on rencontre lors de l’analyse préalable de la tâche d’extraction réside dans l’absence d’une telle définition. Nous proposons ici des conventions dans le but d’en développer une. Le principal concept utilisé pour son expression est la structuration de l’information sous forme de triplets sujet, relation, objet. Le début de définition ainsi construit est exploité lors de l’étape d’extraction d’informations par transducteurs à états finis. Il permet également de suggérer une solution d’implémentation pour l’organisation des données extraites en base de connaissances.