2020
pdf
bib
abs
Building a Universal Dependencies Treebank for Occitan
Aleksandra Miletic
|
Myriam Bras
|
Marianne Vergez-Couret
|
Louise Esher
|
Clamença Poujade
|
Jean Sibille
Proceedings of the 12th Language Resources and Evaluation Conference
This paper outlines the ongoing effort of creating the first treebank for Occitan, a low-ressourced regional language spoken mainly in the south of France. We briefly present the global context of the project and report on its current status. We adopt the Universal Dependencies framework for this project. Our methodology is based on two main principles. Firstly, in order to guarantee the annotation quality, we use the agile annotation approach. Secondly, we rely on pre-processing using existing tools (taggers and parsers) to facilitate the work of human annotators, mainly through a delexicalized cross-lingual parsing approach. We present the results available at this point (annotation guidelines and a sub-corpus annotated with PoS tags and lemmas) and give the timeline for the rest of the work.
pdf
bib
abs
A Four-Dialect Treebank for Occitan: Building Process and Parsing Experiments
Aleksandra Miletic
|
Myriam Bras
|
Marianne Vergez-Couret
|
Louise Esher
|
Clamença Poujade
|
Jean Sibille
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects
Occitan is a Romance language spoken mainly in the south of France. It has no official status in the country, it is not standardized and displays important diatopic variation resulting in a rich system of dialects. Recently, a first treebank for this language was created. However, this corpus is based exclusively on texts in the Lengadocian dialect. Our paper describes the work aimed at extending the existing corpus with content in three new dialects, namely Gascon, Provençau and Lemosin. We describe both the annotation of initial content in these new varieties of Occitan and experiments allowing us to identify the most efficient method for further enrichment of the corpus. We observe that parsing models trained on Occitan dialects achieve better results than a delexicalized model trained on other Romance languages despite the latter training corpus being much larger (20K vs 900K tokens). The results of the native Occitan models show an important impact of cross-dialectal lexical variation, whereas syntactic variation seems to affect the systems less. We hope that the resulting corpus, incorporating several Occitan varieties, will facilitate the training of robust NLP tools, capable of processing all kinds of Occitan texts.
2019
pdf
bib
Building a treebank for Occitan: what use for Romance UD corpora?
Aleksandra Miletic
|
Myriam Bras
|
Louise Esher
|
Jean Sibille
|
Marianne Vergez-Couret
Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019)
pdf
bib
abs
Transformation d’annotations en parties du discours et lemmes vers le format Universal Dependencies : étude de cas pour l’alsacien et l’occitan (Converting POS-tag and Lemma Annotations into the Universal Dependencies Format : A Case Study on Alsatian and Occitan )
Aleksandra Miletić
|
Delphine Bernhard
|
Myriam Bras
|
Anne-Laure Ligozat
|
Marianne Vergez-Couret
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume II : Articles courts
Cet article présente un retour d’expérience sur la transformation de corpus annotés pour l’alsacien et l’occitan vers le format CONLL-U défini dans le projet Universal Dependencies. Il met en particulier l’accent sur divers points de vigilance à prendre en compte, concernant la tokénisation et la définition des catégories pour l’annotation.
2018
pdf
bib
Corpora with Part-of-Speech Annotations for Three Regional Languages of France: Alsatian, Occitan and Picard
Delphine Bernhard
|
Anne-Laure Ligozat
|
Fanny Martin
|
Myriam Bras
|
Pierre Magistry
|
Marianne Vergez-Couret
|
Lucie Steiblé
|
Pascale Erhart
|
Nabil Hathout
|
Dominique Huck
|
Christophe Rey
|
Philippe Reynés
|
Sophie Rosset
|
Jean Sibille
|
Thomas Lavergne
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2014
pdf
bib
Pos-tagging different varieties of Occitan with single-dialect resources
Marianne Vergez-Couret
|
Assaf Urieli
Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects
2012
pdf
bib
Exploiting naive vs expert discourse annotations: an experiment using lexical cohesion to predict Elaboration / Entity-Elaboration confusions
Clémentine Adam
|
Marianne Vergez-Couret
Proceedings of the Sixth Linguistic Annotation Workshop
pdf
bib
abs
An empirical resource for discovering cognitive principles of discourse organisation: the ANNODIS corpus
Stergos Afantenos
|
Nicholas Asher
|
Farah Benamara
|
Myriam Bras
|
Cécile Fabre
|
Mai Ho-dac
|
Anne Le Draoulec
|
Philippe Muller
|
Marie-Paule Péry-Woodley
|
Laurent Prévot
|
Josette Rebeyrolles
|
Ludovic Tanguy
|
Marianne Vergez-Couret
|
Laure Vieu
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
This paper describes the ANNODIS resource, a discourse-level annotated corpus for French. The corpus combines two perspectives on discourse: a bottom-up approach and a top-down approach. The bottom-up view incrementally builds a structure from elementary discourse units, while the top-down view focuses on the selective annotation of multi-level discourse structures. The corpus is composed of texts that are diversified with respect to genre, length and type of discursive organisation. The methodology followed here involves an iterative design of annotation guidelines in order to reach satisfactory inter-annotator agreement levels. This allows us to raise a few issues relevant for the comparison of such complex objects as discourse structures. The corpus also serves as a source of empirical evidence for discourse theories. We present here two first analyses taking advantage of this new annotated corpus --one that tested hypotheses on constraints governing discourse structure, and another that studied the variations in composition and signalling of multi-level discourse structures.
2009
pdf
bib
abs
ANNODIS: une approche outillée de l’annotation de structures discursives
Marie-Paule Péry-Woodley
|
Nicholas Asher
|
Patrice Enjalbert
|
Farah Benamara
|
Myriam Bras
|
Cécile Fabre
|
Stéphane Ferrari
|
Lydia-Mai Ho-Dac
|
Anne Le Draoulec
|
Yann Mathet
|
Philippe Muller
|
Laurent Prévot
|
Josette Rebeyrolle
|
Ludovic Tanguy
|
Marianne Vergez-Couret
|
Laure Vieu
|
Antoine Widlöcher
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts
Le projet ANNODIS vise la construction d’un corpus de textes annotés au niveau discursif ainsi que le développement d’outils pour l’annotation et l’exploitation de corpus. Les annotations adoptent deux points de vue complémentaires : une perspective ascendante part d’unités de discours minimales pour construire des structures complexes via un jeu de relations de discours ; une perspective descendante aborde le texte dans son entier et se base sur des indices pré-identifiés pour détecter des structures discursives de haut niveau. La construction du corpus est associée à la création de deux interfaces : la première assiste l’annotation manuelle des relations et structures discursives en permettant une visualisation du marquage issu des prétraitements ; une seconde sera destinée à l’exploitation des annotations. Nous présentons les modèles et protocoles d’annotation élaborés pour mettre en oeuvre, au travers de l’interface dédiée, la campagne d’annotation.