This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
OlivierBoëffard
Also published as:
Olivier Boeffard
Fixing paper assignments
Please select all papers that do not belong to this person.
Indicate below which author they should be assigned to.
Nous explorons l’évaluation de la tâche de description automatique de scènes audio à travers une approche indirecte basée sur la réponse aux questions sur des documents audio. En l’absence de métriques d’évaluation robustes et automatiques pour la tâche de description automatique de scènes audio, nous nous appuyons sur le benchmark MMAU, un jeu de questions à choix multiple sur des extraits audio variés. Nous introduisons une architecture en cascade qui dépasse les performances de certains modèles de référence de taille comparable. Toutefois, nos résultats mettent en évidence des limitations du benchmark MMAU, notamment un biais textuel et une capacité limitée à évaluer l’intégration conjointe des informations relatives à la parole et aux événements sonores. Nous suggérons des pistes d’amélioration pour rendre les évaluations futures plus fidèles aux enjeux de la tâche de description automatique de scènes audio.
Set covering algorithms are efficient tools for solving an optimal linguistic corpus reduction. The optimality of such a process is directly related to the descriptive features of the sentences of a reference corpus. This article suggests to verify experimentally the behaviour of three algorithms, a greedy approach and a lagrangian relaxation based one giving importance to rare events and a third one considering the Kullback-Liebler divergence between a reference and the ongoing distribution of events. The analysis of the content of the reduced corpora shows that the both first approaches stay the most effective to compress a corpus while guaranteeing a minimal content. The variant which minimises the Kullback-Liebler divergence guarantees a distribution of events close to a reference distribution as expected; however, the price for this solution is a much more important corpus. In the proposed experiments, we have also evaluated a mixed-approach considering a random complement to the smallest coverings.
Building speech corpora is a first and crucial step for every text-to-speech synthesis system. Nowadays, the use of statistical models implies the use of huge sized corpora that need to be recorded, transcribed, annotated and segmented to be usable. The variety of corpora necessary for recent applications (content, style, etc.) makes the use of existing digital audio resources very attractive. Among all available resources, audiobooks, considering their quality, are interesting. Considering this framework, we propose a complete acquisition, segmentation and annotation chain for audiobooks that tends to be fully automatic. The proposed process relies on a data structure, Roots, that establishes the relations between the different annotation levels represented as sequences of items. This methodology has been applied successfully on 11 hours of speech extracted from an audiobook. A manual check, on a part of the corpus, shows the efficiency of the process.
In order to improve the flexibility and the precision of an automatic phone segmentation system for a type of expressive speech, the dubbing into French of fiction movies, we developed both the phonetic labeling process and the alignment process. The automatic labelling system relies on an automatic grapheme-to-phoneme conversion including all the variants of the phonetic chain and on HMM modeling. In this article, we will distinguish three sets of phone models: a set of context independent models, a set of left and right context dependant models and finally a mixing of the two that combines phone and triphone models according to the precision of alignment obtained for each phonetic broad-class. The three models are evaluated on a test corpus. On the one hand we notice a little decrease in the score of phonetic labelling mainly due to pauses insertions, but on the other hand the mixed set of models gives the best results for the score of precision of the alignment.
In this article, we propose a web based listening test system that can be used with a large range of listeners. Our main goals were to make the configuration of the tests as simple and flexible as possible, to simplify the recruiting of the testees and, of course, to keep track of the results using a relational database. This first version of our system can perform the most widely used listening tests in the speech processing community (AB-BA, ABX and MOS tests). It can also easily evolve and propose other tests implemented by the tester by means of a module interface. This scenario is explored in this article which proposes an implementation of a module for Comparison Mean Opinion Score (CMOS) tests and conduct of such an experiment. This test allowed us to extract from the BREF120 corpus a couple of voices of distinct supra-segmental characteristics. This system is offered to the speech synthesis and speech conversion community under free license.
This article is interested in the problem of the linguistic content of a speech corpus. Depending on the target task, the phonological and linguistic content of the corpus is controlled by collecting a set of sentences which covers a preset description of phonological attributes under the constraint of an overall duration as small as possible. This goal is classically achieved by greedy algorithms which however do not guarantee the optimality of the desired cover. In recent works, a lagrangian-based algorithm, called LamSCP, has been used to extract coverings of diphonemes from a large corpus in French, giving better results than a greedy algorithm. We propose to keep comparing both algorithms in terms of the shortest duration, stability and robustness by achieving multi-represented diphoneme or triphoneme covering. These coverings correspond to very large scale optimization problems, from a corpus in English. For each experiment, LamSCP improves the greedy results from 3.9 to 9.7 percent.
Cet article présente une évaluation de modèles statistiques du langage menée sur la langue Française. Nous avons cherché à comparer la performance de modèles de langage exotiques par rapport aux modèles plus classiques de n-gramme à horizon fixe. Les expériences réalisées montrent que des modèles de n-gramme à horizon variable peuvent faire baisser de plus de 10% en moyenne la perplexité d’un modèle de n-gramme à horizon fixe. Les modèles de n/m-multigramme demandent une adaptation pour pouvoir être concurrentiels.