This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
GérardBailly
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
Cette étude vise d’une part, à identifier les indices respiratoires pouvant être considérés comme la signature de l’amélioration de la fluence, et d’autre part, à examiner les effets de l’entraînement de lecture assistée par ordinateur sur la progression de la coordination respiration/parole. 66 élèves (CE2-CM2) ont été répartis en trois groupes selon le mode d’entraînement suivi : contrôle, entraînement avec surlignage par mot et entraînement avec surlignage par groupe de souffle. Tous ont été enregistrés avant (pré-test) et après trois semaines d’entraînement de lecture assistée (post-test) lors de la lecture d’un texte entraîné et d’un autre non-entraîné. Les résultats indiquent que la planification respiratoire et la gestion des pauses est améliorée sur un texte entraîné. Toutefois, il n’y a pas de transfert significatif de ces améliorations sur le texte non-entraîné.
We developped a web app for ascribing verbal descriptions to expressive audiovisual utterances. These descriptions are limited to lists of adjectives that are either suggested via a navigation in emotional latent spaces built using discriminant analysis of BERT embeddings or entered freely by subjects. We show that such verbal descriptions collected on-line via Prolific on massive data (310 participants, 12620 labelled utterances up-to-now) provide Expressive Multimodal Text-to-Speech Synthesis with precise verbal control over desired emotional content
Verbal and nonverbal communication skills are essential for human-robot interaction, in particular when the agents are involved in a shared task. We address the specific situation when the robot is the only agent knowing about the plan and the goal of the task and has to instruct the human partner. The case study is a brick assembly. We here describe a multi-layered verbal depictor whose semantic, syntactic and lexical settings have been collected and evaluated via crowdsourcing. One crowdsourced experiment involves a robot instructed pick-and-place task. We show that implicitly referring to achieved subgoals (stairs, pillows, etc) increases performance of human partners.
The objective of this research is to estimate multidimensional subjective ratings of the reading performance of young readers from signal-based objective measures. We here combine linguistic features (number of correct words, repetitions, deletions, insertions uttered per minute . . . ) with phonetic features. Expressivity is particularly difficult to predict since there is no unique golden standard. We here propose a novel framework for performing such an estimation that exploits multiple references performed by adults and demonstrate its efficiency using recordings of 273 pupils.
We present a series of experiments investigating face-to-face interaction between an Embodied Conversational Agent (ECA) and a human interlocutor. The ECA is embodied by a video realistic talking head with independent head and eye movements. For a beneficial application in face-to-face interaction, the ECA should be able to derive meaning from communicational gestures of a human interlocutor, and likewise to reproduce such gestures. Conveying its capability to interpret human behaviour, the system encourages the interlocutor to show appropriate natural activity. Therefore it is important that the ECA knows how to display what would correspond to mental states in humans. This allows to interpret the machine processes of the system in terms of human expressiveness and to assign them a corresponding meaning. Thus the system may maintain an interaction based on human patterns. During a first experiment we investigated the ability of our talking head to direct user attention with facial deictic cues (Raidt, Bailly et al. 2005). Users interact with the ECA during a simple card game offering different levels of help and guidance through facial deictic cues. We analyzed the users performance and their perception of the quality of assistance given by the ECA. The experiment showed that users profit from its presence and its facial deictic cues. In the continuative series of experiments presented here, we investigated the effect of an enhancement of the multimodality of the deictic gestures by adding a spoken instruction.
The EVALDA/EvaSy project is dedicated to the evaluation of text-to-speech synthesis systems for the French language. It is subdivided into four components: evaluation of the grapheme-to-phoneme conversion module (Boula de Mareüil et al., 2005), evaluation of prosody (Garcia et al., 2006), evaluation of intelligibility, and global evaluation of the quality of the synthesised speech. This paper reports on the key results of the intelligibility and global evaluation of the synthesised speech. It focuses on intelligibility, assessed on the basis of semantically unpredictable sentences, but a comparison with absolute category rating in terms of e.g. pleasantness and naturalness is also provided. Three diphone systems and three selection systems have been evaluated. It turns out that the most intelligible system (diphone-based) is far from being the one which obtains the best mean opinion score.
This paper reports on prosodic evaluation in the framework of the EVALDA/EvaSy project for text-to-speech (TTS) evaluation for the French language. Prosody is evaluated using a prosodic transplantation paradigm. Intonation contours generated by the synthesis systems are transplanted on a common segmental content. Both diphone based synthesis and natural speech are used. Five TTS systems are tested along with natural voice. The test is a paired preference test (with 19 subjects), using 7 sentences. The results indicate that natural speech obtains consistently the first rank (with an average preference rate of 80%), followed by a selection based system (72%) and a diphone based system (58%). However, rather large variations in judgements are observed among subjects and sentences, and in some cases synthetic speech is preferred to natural speech. These results show the remarkable improvement achieved by the best selection based synthesis systems in terms of prosody. In this way; a new paradigm for evaluation of the prosodic component of TTS systems has been successfully demonstrated.