Gérard Bailly

2024

pdf abs
Emotags: Computer-Assisted Verbal Labelling of Expressive Audiovisual Utterances for Expressive Multimodal TTS
Gérard Bailly | Romain Legrand | Martin Lenglet | Frédéric Elisei | Maëva Hueber | Olivier Perrotin
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We developped a web app for ascribing verbal descriptions to expressive audiovisual utterances. These descriptions are limited to lists of adjectives that are either suggested via a navigation in emotional latent spaces built using discriminant analysis of BERT embeddings or entered freely by subjects. We show that such verbal descriptions collected on-line via Prolific on massive data (310 participants, 12620 labelled utterances up-to-now) provide Expressive Multimodal Text-to-Speech Synthesis with precise verbal control over desired emotional content

pdf abs
Entraînement de la coordination respiration-parole en apprentissage de la lecture assistée par ordinateur
Delphine Charuau | Andrea Briglia | Erika Godde | Gérard Bailly
Actes des 35èmes Journées d'Études sur la Parole

Cette étude vise d’une part, à identifier les indices respiratoires pouvant être considérés comme la signature de l’amélioration de la fluence, et d’autre part, à examiner les effets de l’entraînement de lecture assistée par ordinateur sur la progression de la coordination respiration/parole. 66 élèves (CE2-CM2) ont été répartis en trois groupes selon le mode d’entraînement suivi : contrôle, entraînement avec surlignage par mot et entraînement avec surlignage par groupe de souffle. Tous ont été enregistrés avant (pré-test) et après trois semaines d’entraînement de lecture assistée (post-test) lors de la lecture d’un texte entraîné et d’un autre non-entraîné. Les résultats indiquent que la planification respiratoire et la gestion des pauses est améliorée sur un texte entraîné. Toutefois, il n’y a pas de transfert significatif de ces améliorations sur le texte non-entraîné.

2022

pdf abs
Automatic Verbal Depiction of a Brick Assembly for a Robot Instructing Humans
Rami Younes | Gérard Bailly | Frederic Elisei | Damien Pellier
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue

Verbal and nonverbal communication skills are essential for human-robot interaction, in particular when the agents are involved in a shared task. We address the specific situation when the robot is the only agent knowing about the plan and the goal of the task and has to instruct the human partner. The case study is a brick assembly. We here describe a multi-layered verbal depictor whose semantic, syntactic and lexical settings have been collected and evaluated via crowdsourcing. One crowdsourced experiment involves a robot instructed pick-and-place task. We show that implicitly referring to achieved subgoals (stairs, pillows, etc) increases performance of human partners.

2020

pdf abs
Predicting Multidimensional Subjective Ratings of Children’ Readings from the Speech Signals for the Automatic Assessment of Fluency
Gérard Bailly | Erika Godde | Anne-Laure Piat-Marchand | Marie-Line Bosse
Proceedings of the Twelfth Language Resources and Evaluation Conference

The objective of this research is to estimate multidimensional subjective ratings of the reading performance of young readers from signal-based objective measures. We here combine linguistic features (number of correct words, repetitions, deletions, insertions uttered per minute . . . ) with phonetic features. Expressivity is particularly difficult to predict since there is no unique golden standard. We here propose a novel framework for performing such an estimation that exploits multiple references performed by adults and demonstrate its efficiency using recordings of 273 pupils.

2012

pdf
Vizart3D : Retour Articulatoire Visuel pour l’Aide à la Prononciation (Vizart3D: Visual Articulatory Feedack for Computer-Assisted Pronunciation Training) [in French]
Thomas Hueber | Atef Ben-Youssef | Pierre Badin | Gérard Bailly | Frédéric Eliséi
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 5: Software Demonstrations

2006

pdf abs
Does a Virtual Talking Face Generate Proper Multimodal Cues to Draw User’s Attention to Points of Interest?
Stephan Raidt | Gérard Bailly | Frederic Elisei
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We present a series of experiments investigating face-to-face interaction between an Embodied Conversational Agent (ECA) and a human interlocutor. The ECA is embodied by a video realistic talking head with independent head and eye movements. For a beneficial application in face-to-face interaction, the ECA should be able to derive meaning from communicational gestures of a human interlocutor, and likewise to reproduce such gestures. Conveying its capability to interpret human behaviour, the system encourages the interlocutor to show appropriate natural activity. Therefore it is important that the ECA knows how to display what would correspond to mental states in humans. This allows to interpret the machine processes of the system in terms of human expressiveness and to assign them a corresponding meaning. Thus the system may maintain an interaction based on human patterns. During a first experiment we investigated the ability of our talking head to direct user attention with facial deictic cues (Raidt, Bailly et al. 2005). Users interact with the ECA during a simple card game offering different levels of help and guidance through facial deictic cues. We analyzed the users performance and their perception of the quality of assistance given by the ECA. The experiment showed that users profit from its presence and its facial deictic cues. In the continuative series of experiments presented here, we investigated the effect of an enhancement of the multimodality of the deictic gestures by adding a spoken instruction.

pdf abs
A joint intelligibility evaluation of French text-to-speech synthesis systems: the EvaSy SUS/ACR campaign
Philippe Boula de Mareüil | Christophe d’Alessandro | Alexander Raake | Gérard Bailly | Marie-Neige Garcia | Michel Morel
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The EVALDA/EvaSy project is dedicated to the evaluation of text-to-speech synthesis systems for the French language. It is subdivided into four components: evaluation of the grapheme-to-phoneme conversion module (Boula de Mareüil et al., 2005), evaluation of prosody (Garcia et al., 2006), evaluation of intelligibility, and global evaluation of the quality of the synthesised speech. This paper reports on the key results of the intelligibility and global evaluation of the synthesised speech. It focuses on intelligibility, assessed on the basis of semantically unpredictable sentences, but a comparison with absolute category rating in terms of e.g. pleasantness and naturalness is also provided. Three diphone systems and three selection systems have been evaluated. It turns out that the most intelligible system (diphone-based) is far from being the one which obtains the best mean opinion score.

pdf abs
A joint prosody evaluation of French text-to-speech synthesis systems
Marie-Neige Garcia | Christophe d’Alessandro | Gérard Bailly | Philippe Boula de Mareüil | Michel Morel
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper reports on prosodic evaluation in the framework of the EVALDA/EvaSy project for text-to-speech (TTS) evaluation for the French language. Prosody is evaluated using a prosodic transplantation paradigm. Intonation contours generated by the synthesis systems are transplanted on a common segmental content. Both diphone based synthesis and natural speech are used. Five TTS systems are tested along with natural voice. The test is a paired preference test (with 19 subjects), using 7 sentences. The results indicate that natural speech obtains consistently the first rank (with an average preference rate of 80%), followed by a selection based system (72%) and a diphone based system (58%). However, rather large variations in judgements are observed among subjects and sentences, and in some cases synthetic speech is preferred to natural speech. These results show the remarkable improvement achieved by the best selection based synthesis systems in terms of prosody. In this way; a new paradigm for evaluation of the prosodic component of TTS systems has been successfully demonstrated.

2004

pdf
Evaluation of a Speech Cuer: From Motion Capture to a Concatenative Text-to-cued Speech System
Guillaume Gibert | Gérard Bailly | Frédéric Eliséi | Denis Beautemps | Rémi Brun
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

2000

pdf bib
The Cost258 Signal Generation Test Array
Gérard Bailly | Eduardo R. Banga | Alex Monaghan | Erhard Rank
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)