Spoken language understanding research to date has generally carried a heavy text perspective. Most datasets are derived from text, which is then subsequently synthesized into speech, and most models typically rely on automatic transcriptions of speech. This is to the detriment of prosody–additional information carried by the speech signal beyond the phonetics of the words themselves and difficult to recover from text alone. In this work, we investigate the role of prosody in Spoken Question Answering. By isolating prosodic and lexical information on the SLUE-SQA-5 dataset, which consists of natural speech, we demonstrate that models trained on prosodic information alone can perform reasonably well by utilizing prosodic cues. However, we find that when lexical information is available, models tend to predominantly rely on it. Our findings suggest that while prosodic cues provide valuable supplementary information, more effective integration methods are required to ensure prosody contributes more significantly alongside lexical features.
We introduce EmphAssess, a prosodic benchmark designed to evaluate the capability of speech-to-speech models to encode and reproduce prosodic emphasis. We apply this to two tasks: speech resynthesis and speech-to-speech translation. In both cases, the benchmark evaluates the ability of the model to encode emphasis in the speech input and accurately reproduce it in the output, potentially across a change of speaker and language. As part of the evaluation pipeline, we introduce EmphaClass, a new model that classifies emphasis at the frame or word level.
Dans ce papier, nous présentons la participation de Qwant Research aux tâches 2 et 3 de l’édition 2019 du défi fouille de textes (DEFT) portant sur l’analyse de documents cliniques rédigés en français. La tâche 2 est une tâche de similarité sémantique qui demande d’apparier cas cliniques et discussions médicales. Pour résoudre cette tâche, nous proposons une approche reposant sur des modèles de langue et évaluons l’impact de différents pré-traitements et de différentes techniques d’appariement sur les résultats. Pour la tâche 3, nous avons développé un système d’extraction d’information qui produit des résultats encourageants en termes de précision. Nous avons expérimenté deux approches différentes, l’une se fondant exclusivement sur l’utilisation de réseaux de neurones pour traiter la tâche, l’autre reposant sur l’exploitation des informations linguistiques issues d’une analyse syntaxique.