Emmanuel Dupoux


2022

pdf bib
textless-lib: a Library for Textless Spoken Language Processing
Eugene Kharitonov | Jade Copet | Kushal Lakhotia | Tu Anh Nguyen | Paden Tomasello | Ann Lee | Ali Elkahky | Wei-Ning Hsu | Abdelrahman Mohamed | Emmanuel Dupoux | Yossi Adi
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: System Demonstrations

Textless spoken language processing is an exciting area of research that promises to extend applicability of the standard NLP toolset onto spoken language and languages with few or no textual resources.Here, we introduce textless-lib, a PyTorch-based library aimed to facilitate research in the area. We describe the building blocks that the library provides and demonstrate its usability by discuss three different use-case examples: (i) speaker probing, (ii) speech resynthesis and compression, and (iii) speech continuation. We believe that textless-lib substantially simplifies research the textless setting and will be handful not only for speech researchers but also for the NLP community at large.

pdf
A comparison study on patient-psychologist voice diarization
Rachid Riad | Hadrien Titeux | Laurie Lemoine | Justine Montillot | Agnes Sliwinski | Jennifer Bagnou | Xuan Cao | Anne-Catherine Bachoud-Levi | Emmanuel Dupoux
Ninth Workshop on Speech and Language Processing for Assistive Technologies (SLPAT-2022)

Conversations between a clinician and a patient, in natural conditions, are valuable sources of information for medical follow-up. The automatic analysis of these dialogues could help extract new language markers and speed up the clinicians’ reports. Yet, it is not clear which model is the most efficient to detect and identify the speaker turns, especially for individuals with speech disorders. Here, we proposed a split of the data that allows conducting a comparative evaluation of different diarization methods. We designed and trained end-to-end neural network architectures to directly tackle this task from the raw signal and evaluate each approach under the same metric. We also studied the effect of fine-tuning models to find the best performance. Experimental results are reported on naturalistic clinical conversations between Psychologists and Interviewees, at different stages of Huntington’s disease, displaying a large panel of speech disorders. We found out that our best end-to-end model achieved 19.5 % IER on the test set, compared to 23.6% achieved by the finetuning of the X-vector architecture. Finally, we observed that we could extract clinical markers directly from the automatic systems, highlighting the clinical relevance of our methods.

pdf
Text-Free Prosody-Aware Generative Spoken Language Modeling
Eugene Kharitonov | Ann Lee | Adam Polyak | Yossi Adi | Jade Copet | Kushal Lakhotia | Tu Anh Nguyen | Morgane Riviere | Abdelrahman Mohamed | Emmanuel Dupoux | Wei-Ning Hsu
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Speech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barely been explored. Generative Spoken Language Modeling (GSLM) (CITATION) is the only prior work addressing the generative aspect of speech pre-training, which builds a text-free language model using discovered units. Unfortunately, because the units used in GSLM discard most prosodic information, GSLM fails to leverage prosody for better comprehension and does not generate expressive speech. In this work, we present a prosody-aware generative spoken language model (pGSLM). It is composed of a multi-stream transformer language model (MS-TLM) of speech, represented as discovered unit and prosodic feature streams, and an adapted HiFi-GAN model converting MS-TLM outputs to waveforms. Experimental results show that the pGSLM can utilize prosody to improve both prosody and content modeling, and also generate natural, meaningful, and coherent speech given a spoken prompt. Audio samples can be found at https://speechbot.github.io/pgslm. Codes and models are available at https://github.com/pytorch/fairseq/tree/main/examples/textless_nlp/pgslm.

pdf
Textless Speech Emotion Conversion using Discrete & Decomposed Representations
Felix Kreuk | Adam Polyak | Jade Copet | Eugene Kharitonov | Tu Anh Nguyen | Morgan Rivière | Wei-Ning Hsu | Abdelrahman Mohamed | Emmanuel Dupoux | Yossi Adi
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Speech emotion conversion is the task of modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity. In this study, we cast the problem of emotion conversion as a spoken language translation task. We use a decomposition of the speech signal into discrete learned representations, consisting of phonetic-content units, prosodic features, speaker, and emotion. First, we modify the speech content by translating the phonetic-content units to a target emotion, and then predict the prosodic features based on these units. Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder. Such a paradigm allows us to go beyond spectral and parametric changes of the signal, and model non-verbal vocalizations, such as laughter insertion, yawning removal, etc. We demonstrate objectively and subjectively that the proposed method is vastly superior to current approaches and even beats text-based systems in terms of perceived emotion and audio quality. We rigorously evaluate all components of such a complex system and conclude with an extensive model analysis and ablation study to better emphasize the architectural choices, strengths and weaknesses of the proposed method. Samples are available under the following link: https://speechbot.github.io/emotion

pdf
DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon
Robin Algayres | Tristan Ricoul | Julien Karadayi | Hugo Laurençon | Salah Zaiem | Abdelrahman Mohamed | Benoît Sagot | Emmanuel Dupoux
Transactions of the Association for Computational Linguistics, Volume 10

Finding word boundaries in continuous speech is challenging as there is little or no equivalent of a ‘space’ delimiter between words. Popular Bayesian non-parametric models for text segmentation (Goldwater et al., 2006, 2009) use a Dirichlet process to jointly segment sentences and build a lexicon of word types. We introduce DP-Parse, which uses similar principles but only relies on an instance lexicon of word tokens, avoiding the clustering errors that arise with a lexicon of word types. On the Zero Resource Speech Benchmark 2017, our model sets a new speech segmentation state-of-the-art in 5 languages. The algorithm monotonically improves with better input representations, achieving yet higher scores when fed with weakly supervised inputs. Despite lacking a type lexicon, DP-Parse can be pipelined to a language model and learn semantic and syntactic representations as assessed by a new spoken word embedding benchmark. 1

2021

pdf
On Generative Spoken Language Modeling from Raw Audio
Kushal Lakhotia | Eugene Kharitonov | Wei-Ning Hsu | Yossi Adi | Adam Polyak | Benjamin Bolte | Tu-Anh Nguyen | Jade Copet | Alexei Baevski | Abdelrahman Mohamed | Emmanuel Dupoux
Transactions of the Association for Computational Linguistics, Volume 9

Abstract We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo- text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder- dependent way, and that some combinations approach text-based systems.1

pdf bib
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
Jasmijn Bastings | Yonatan Belinkov | Emmanuel Dupoux | Mario Giulianelli | Dieuwke Hupkes | Yuval Pinter | Hassan Sajjad
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

pdf
VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
Changhan Wang | Morgane Riviere | Ann Lee | Anne Wu | Chaitanya Talnikar | Daniel Haziza | Mary Williamson | Juan Pino | Emmanuel Dupoux
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

We introduce VoxPopuli, a large-scale multilingual corpus providing 400K hours of unlabeled speech data in 23 languages. It is the largest open data to date for unsupervised representation learning as well as semi-supervised learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 15 languages and their aligned oral interpretations into 15 target languages totaling 17.3K hours. We provide speech recognition (ASR) baselines and validate the versatility of VoxPopuli unlabeled data in semi-supervised ASR and speech-to-text translation under challenging out-of-domain settings. The corpus is available at https://github.com/facebookresearch/voxpopuli.

2020

pdf
LazImpa”: Lazy and Impatient neural agents learn to communicate efficiently
Mathieu Rita | Rahma Chaabouni | Emmanuel Dupoux
Proceedings of the 24th Conference on Computational Natural Language Learning

Previous work has shown that artificial neural agents naturally develop surprisingly non-efficient codes. This is illustrated by the fact that in a referential game involving a speaker and a listener neural networks optimizing accurate transmission over a discrete channel, the emergent messages fail to achieve an optimal length. Furthermore, frequent messages tend to be longer than infrequent ones, a pattern contrary to the Zipf Law of Abbreviation (ZLA) observed in all natural languages. Here, we show that near-optimal and ZLA-compatible messages can emerge, but only if both the speaker and the listener are modified. We hence introduce a new communication system, “LazImpa”, where the speaker is made increasingly lazy, i.e., avoids long messages, and the listener impatient, i.e., seeks to guess the intended content as soon as possible.

pdf
Analogies minus analogy test: measuring regularities in word embeddings
Louis Fournier | Emmanuel Dupoux | Ewan Dunbar
Proceedings of the 24th Conference on Computational Natural Language Learning

Vector space models of words have long been claimed to capture linguistic regularities as simple vector translations, but problems have been raised with this claim. We decompose and empirically analyze the classic arithmetic word analogy test, to motivate two new metrics that address the issues with the standard test, and which distinguish between class-wise offset concentration (similar directions between pairs of words drawn from different broad classes, such as France-London, China-Ottawa,...) and pairing consistency (the existence of a regular transformation between correctly-matched pairs such as France:Paris::China:Beijing). We show that, while the standard analogy test is flawed, several popular word embeddings do nevertheless encode linguistic regularities.

pdf
Compositionality and Generalization In Emergent Languages
Rahma Chaabouni | Eugene Kharitonov | Diane Bouchacourt | Emmanuel Dupoux | Marco Baroni
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Natural language allows us to refer to novel composite concepts by combining expressions denoting their parts according to systematic rules, a property known as compositionality. In this paper, we study whether the language emerging in deep multi-agent simulations possesses a similar ability to refer to novel primitive combinations, and whether it accomplishes this feat by strategies akin to human-language compositionality. Equipped with new ways to measure compositionality in emergent languages inspired by disentanglement in representation learning, we establish three main results: First, given sufficiently large input spaces, the emergent language will naturally develop the ability to refer to novel composite concepts. Second, there is no correlation between the degree of compositionality of an emergent language and its ability to generalize. Third, while compositionality is not necessary for generalization, it provides an advantage in terms of language transmission: The more compositional a language is, the more easily it will be picked up by new learners, even when the latter differ in architecture from the original agents. We conclude that compositionality does not arise from simple generalization pressure, but if an emergent language does chance upon it, it will be more likely to survive and thrive.

pdf
Identification of Primary and Collateral Tracks in Stuttered Speech
Rachid Riad | Anne-Catherine Bachoud-Lévi | Frank Rudzicz | Emmanuel Dupoux
Proceedings of the Twelfth Language Resources and Evaluation Conference

Disfluent speech has been previously addressed from two main perspectives: the clinical perspective focusing on diagnostic, and the Natural Language Processing (NLP) perspective aiming at modeling these events and detect them for downstream tasks. In addition, previous works often used different metrics depending on whether the input features are text or speech, making it difficult to compare the different contributions. Here, we introduce a new evaluation framework for disfluency detection inspired by the clinical and NLP perspective together with the theory of performance from (Clark, 1996) which distinguishes between primary and collateral tracks. We introduce a novel forced-aligned disfluency dataset from a corpus of semi-directed interviews, and present baseline results directly comparing the performance of text-based features (word and span information) and speech-based (acoustic-prosodic information). Finally, we introduce new audio features inspired by the word-based span features. We show experimentally that using these features outperformed the baselines for speech-based predictions on the present dataset.

pdf
Seshat: a Tool for Managing and Verifying Annotation Campaigns of Audio Data
Hadrien Titeux | Rachid Riad | Xuan-Nga Cao | Nicolas Hamilakis | Kris Madden | Alejandrina Cristia | Anne-Catherine Bachoud-Lévi | Emmanuel Dupoux
Proceedings of the Twelfth Language Resources and Evaluation Conference

We introduce Seshat, a new, simple and open-source software to efficiently manage annotations of speech corpora. The Seshat software allows users to easily customise and manage annotations of large audio corpora while ensuring compliance with the formatting and naming conventions of the annotated output files. In addition, it includes procedures for checking the content of annotations following specific rules that can be implemented in personalised parsers. Finally, we propose a double-annotation mode, for which Seshat computes automatically an associated inter-annotator agreement with the gamma measure taking into account the categorisation and segmentation discrepancies.

2019

pdf bib
SyntaxFest 2019 Invited talk - Inductive biases and language emergence in communicative agents
Emmanuel Dupoux
Proceedings of the Fifth International Conference on Dependency Linguistics (Depling, SyntaxFest 2019)

pdf
Word-order Biases in Deep-agent Emergent Communication
Rahma Chaabouni | Eugene Kharitonov | Alessandro Lazaric | Emmanuel Dupoux | Marco Baroni
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Sequence-processing neural networks led to remarkable progress on many NLP tasks. As a consequence, there has been increasing interest in understanding to what extent they process language as humans do. We aim here to uncover which biases such models display with respect to “natural” word-order constraints. We train models to communicate about paths in a simple gridworld, using miniature languages that reflect or violate various natural language trends, such as the tendency to avoid redundancy or to minimize long-distance dependencies. We study how the controlled characteristics of our miniature languages affect individual learning and their stability across multiple network generations. The results draw a mixed picture. On the one hand, neural networks show a strong tendency to avoid long-distance dependencies. On the other hand, there is no clear preference for the efficient, non-redundant encoding of information that is widely attested in natural language. We thus suggest inoculating a notion of “effort” into neural networks, as a possible way to make their linguistic behavior more human-like.

2018

pdf
BabyCloud, a Technological Platform for Parents and Researchers
Xuân-Nga Cao | Cyrille Dakhlia | Patricia Del Carmen | Mohamed-Amine Jaouani | Malik Ould-Arbi | Emmanuel Dupoux
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf
The Role of Prosody and Speech Register in Word Segmentation: A Computational Modelling Perspective
Bogdan Ludusan | Reiko Mazuka | Mathieu Bernard | Alejandrina Cristia | Emmanuel Dupoux
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

This study explores the role of speech register and prosody for the task of word segmentation. Since these two factors are thought to play an important role in early language acquisition, we aim to quantify their contribution for this task. We study a Japanese corpus containing both infant- and adult-directed speech and we apply four different word segmentation models, with and without knowledge of prosodic boundaries. The results showed that the difference between registers is smaller than previously reported and that prosodic boundary information helps more adult- than infant-directed speech.

pdf
Blind Phoneme Segmentation With Temporal Prediction Errors
Paul Michel | Okko Rasanen | Roland Thiollière | Emmanuel Dupoux
Proceedings of ACL 2017, Student Research Workshop

pdf
Comparing Character-level Neural Language Models Using a Lexical Decision Task
Gaël Le Godais | Tal Linzen | Emmanuel Dupoux
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

What is the information captured by neural network models of language? We address this question in the case of character-level recurrent neural language models. These models do not have explicit word representations; do they acquire implicit ones? We assess the lexical capacity of a network using the lexical decision task common in psycholinguistics: the system is required to decide whether or not a string of characters forms a word. We explore how accuracy on this task is affected by the architecture of the network, focusing on cell type (LSTM vs. SRN), depth and width. We also compare these architectural properties to a simple count of the parameters of the network. The overall number of parameters in the network turns out to be the most important predictor of accuracy; in particular, there is little evidence that deeper networks are beneficial for this task.

2016

pdf
Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies
Tal Linzen | Emmanuel Dupoux | Yoav Goldberg
Transactions of the Association for Computational Linguistics, Volume 4

The success of long short-term memory (LSTM) neural networks in language processing is typically attributed to their ability to capture long-distance statistical regularities. Linguistic regularities are often sensitive to syntactic structure; can such dependencies be captured by LSTMs, which do not have explicit structural representations? We begin addressing this question using number agreement in English subject-verb dependencies. We probe the architecture’s grammatical competence both using training objectives with an explicit grammatical target (number prediction, grammaticality judgments) and using language models. In the strongly supervised settings, the LSTM achieved very high overall accuracy (less than 1% errors), but errors increased when sequential and structural information conflicted. The frequency of such errors rose sharply in the language-modeling setting. We conclude that LSTMs can capture a non-trivial amount of grammatical structure given targeted supervision, but stronger architectures may be required to further reduce errors; furthermore, the language modeling signal is insufficient for capturing syntax-sensitive dependencies, and should be supplemented with more direct supervision if such dependencies need to be captured.

pdf bib
Quantificational features in distributional word representations
Tal Linzen | Emmanuel Dupoux | Benjamin Spector
Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics

2015

pdf
Sign constraints on feature weights improve a joint model of word segmentation and phonology
Mark Johnson | Joe Pater | Robert Staubs | Emmanuel Dupoux
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
Prosodic boundary information helps unsupervised word segmentation
Bogdan Ludusan | Gabriel Synnaeve | Emmanuel Dupoux
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
Motif discovery in infant- and adult-directed speech
Bogdan Ludusan | Amanda Seidl | Emmanuel Dupoux | Alex Cristia
Proceedings of the Sixth Workshop on Cognitive Aspects of Computational Language Learning

2014

pdf
Unsupervised Word Segmentation in Context
Gabriel Synnaeve | Isabelle Dautriche | Benjamin Börschinger | Mark Johnson | Emmanuel Dupoux
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf
Modelling function words improves unsupervised word segmentation
Mark Johnson | Anne Christophe | Emmanuel Dupoux | Katherine Demuth
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Exploring the Relative Role of Bottom-up and Top-down Information in Phoneme Learning
Abdellah Fourtassi | Thomas Schatz | Balakrishnan Varadarajan | Emmanuel Dupoux
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf
A Rudimentary Lexicon and Semantics Help Bootstrap Phoneme Acquisition
Abdellah Fourtassi | Emmanuel Dupoux
Proceedings of the Eighteenth Conference on Computational Natural Language Learning

pdf
Bridging the gap between speech technology and natural language processing: an evaluation toolbox for term discovery systems
Bogdan Ludusan | Maarten Versteegh | Aren Jansen | Guillaume Gravier | Xuan-Nga Cao | Mark Johnson | Emmanuel Dupoux
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The unsupervised discovery of linguistic terms from either continuous phoneme transcriptions or from raw speech has seen an increasing interest in the past years both from a theoretical and a practical standpoint. Yet, there exists no common accepted evaluation method for the systems performing term discovery. Here, we propose such an evaluation toolbox, drawing ideas from both speech technology and natural language processing. We first transform the speech-based output into a symbolic representation and compute five types of evaluation metrics on this representation: the quality of acoustic matching, the quality of the clusters found, and the quality of the alignment with real words (type, token, and boundary scores). We tested our approach on two term discovery systems taking speech as input, and one using symbolic input. The latter was run using both the gold transcription and a transcription obtained from an automatic speech recognizer, in order to simulate the case when only imperfect symbolic information is available. The results obtained are analysed through the use of the proposed evaluation metrics and the implications of these metrics are discussed.

2013

pdf bib
Why is English so easy to segment?
Abdellah Fourtassi | Benjamin Börschinger | Mark Johnson | Emmanuel Dupoux
Proceedings of the Fourth Annual Workshop on Cognitive Modeling and Computational Linguistics (CMCL)

pdf
A corpus-based evaluation method for Distributional Semantic Models
Abdellah Fourtassi | Emmanuel Dupoux
51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop

2011

pdf bib
Testing the Robustness of Online Word Segmentation: Effects of Linguistic Diversity and Phonetic Variation
Luc Boruta | Sharon Peperkamp | Benoît Crabbé | Emmanuel Dupoux
Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics

2008

pdf
Unsupervised Learning of Acoustic Sub-word Units
Balakrishnan Varadarajan | Sanjeev Khudanpur | Emmanuel Dupoux
Proceedings of ACL-08: HLT, Short Papers

Search