Emmanuel Dupoux


2021

pdf bib
On Generative Spoken Language Modeling from Raw Audio
Kushal Lakhotia | Eugene Kharitonov | Wei-Ning Hsu | Yossi Adi | Adam Polyak | Benjamin Bolte | Tu-Anh Nguyen | Jade Copet | Alexei Baevski | Abdelrahman Mohamed | Emmanuel Dupoux
Transactions of the Association for Computational Linguistics, Volume 9

Abstract We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo- text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder- dependent way, and that some combinations approach text-based systems.1

pdf bib
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
Jasmijn Bastings | Yonatan Belinkov | Emmanuel Dupoux | Mario Giulianelli | Dieuwke Hupkes | Yuval Pinter | Hassan Sajjad
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

pdf bib
VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
Changhan Wang | Morgane Riviere | Ann Lee | Anne Wu | Chaitanya Talnikar | Daniel Haziza | Mary Williamson | Juan Pino | Emmanuel Dupoux
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

We introduce VoxPopuli, a large-scale multilingual corpus providing 400K hours of unlabeled speech data in 23 languages. It is the largest open data to date for unsupervised representation learning as well as semi-supervised learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 15 languages and their aligned oral interpretations into 15 target languages totaling 17.3K hours. We provide speech recognition (ASR) baselines and validate the versatility of VoxPopuli unlabeled data in semi-supervised ASR and speech-to-text translation under challenging out-of-domain settings. The corpus is available at https://github.com/facebookresearch/voxpopuli.

2020

pdf bib
Identification of Primary and Collateral Tracks in Stuttered Speech
Rachid Riad | Anne-Catherine Bachoud-Lévi | Frank Rudzicz | Emmanuel Dupoux
Proceedings of the 12th Language Resources and Evaluation Conference

Disfluent speech has been previously addressed from two main perspectives: the clinical perspective focusing on diagnostic, and the Natural Language Processing (NLP) perspective aiming at modeling these events and detect them for downstream tasks. In addition, previous works often used different metrics depending on whether the input features are text or speech, making it difficult to compare the different contributions. Here, we introduce a new evaluation framework for disfluency detection inspired by the clinical and NLP perspective together with the theory of performance from (Clark, 1996) which distinguishes between primary and collateral tracks. We introduce a novel forced-aligned disfluency dataset from a corpus of semi-directed interviews, and present baseline results directly comparing the performance of text-based features (word and span information) and speech-based (acoustic-prosodic information). Finally, we introduce new audio features inspired by the word-based span features. We show experimentally that using these features outperformed the baselines for speech-based predictions on the present dataset.

pdf bib
Seshat: a Tool for Managing and Verifying Annotation Campaigns of Audio Data
Hadrien Titeux | Rachid Riad | Xuan-Nga Cao | Nicolas Hamilakis | Kris Madden | Alejandrina Cristia | Anne-Catherine Bachoud-Lévi | Emmanuel Dupoux
Proceedings of the 12th Language Resources and Evaluation Conference

We introduce Seshat, a new, simple and open-source software to efficiently manage annotations of speech corpora. The Seshat software allows users to easily customise and manage annotations of large audio corpora while ensuring compliance with the formatting and naming conventions of the annotated output files. In addition, it includes procedures for checking the content of annotations following specific rules that can be implemented in personalised parsers. Finally, we propose a double-annotation mode, for which Seshat computes automatically an associated inter-annotator agreement with the gamma measure taking into account the categorisation and segmentation discrepancies.

pdf bib
LazImpa”: Lazy and Impatient neural agents learn to communicate efficiently
Mathieu Rita | Rahma Chaabouni | Emmanuel Dupoux
Proceedings of the 24th Conference on Computational Natural Language Learning

Previous work has shown that artificial neural agents naturally develop surprisingly non-efficient codes. This is illustrated by the fact that in a referential game involving a speaker and a listener neural networks optimizing accurate transmission over a discrete channel, the emergent messages fail to achieve an optimal length. Furthermore, frequent messages tend to be longer than infrequent ones, a pattern contrary to the Zipf Law of Abbreviation (ZLA) observed in all natural languages. Here, we show that near-optimal and ZLA-compatible messages can emerge, but only if both the speaker and the listener are modified. We hence introduce a new communication system, “LazImpa”, where the speaker is made increasingly lazy, i.e., avoids long messages, and the listener impatient, i.e., seeks to guess the intended content as soon as possible.

pdf bib
Analogies minus analogy test: measuring regularities in word embeddings
Louis Fournier | Emmanuel Dupoux | Ewan Dunbar
Proceedings of the 24th Conference on Computational Natural Language Learning

Vector space models of words have long been claimed to capture linguistic regularities as simple vector translations, but problems have been raised with this claim. We decompose and empirically analyze the classic arithmetic word analogy test, to motivate two new metrics that address the issues with the standard test, and which distinguish between class-wise offset concentration (similar directions between pairs of words drawn from different broad classes, such as France-London, China-Ottawa,...) and pairing consistency (the existence of a regular transformation between correctly-matched pairs such as France:Paris::China:Beijing). We show that, while the standard analogy test is flawed, several popular word embeddings do nevertheless encode linguistic regularities.

pdf bib
Compositionality and Generalization In Emergent Languages
Rahma Chaabouni | Eugene Kharitonov | Diane Bouchacourt | Emmanuel Dupoux | Marco Baroni
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Natural language allows us to refer to novel composite concepts by combining expressions denoting their parts according to systematic rules, a property known as compositionality. In this paper, we study whether the language emerging in deep multi-agent simulations possesses a similar ability to refer to novel primitive combinations, and whether it accomplishes this feat by strategies akin to human-language compositionality. Equipped with new ways to measure compositionality in emergent languages inspired by disentanglement in representation learning, we establish three main results: First, given sufficiently large input spaces, the emergent language will naturally develop the ability to refer to novel composite concepts. Second, there is no correlation between the degree of compositionality of an emergent language and its ability to generalize. Third, while compositionality is not necessary for generalization, it provides an advantage in terms of language transmission: The more compositional a language is, the more easily it will be picked up by new learners, even when the latter differ in architecture from the original agents. We conclude that compositionality does not arise from simple generalization pressure, but if an emergent language does chance upon it, it will be more likely to survive and thrive.

2019

pdf bib
Word-order Biases in Deep-agent Emergent Communication
Rahma Chaabouni | Eugene Kharitonov | Alessandro Lazaric | Emmanuel Dupoux | Marco Baroni
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Sequence-processing neural networks led to remarkable progress on many NLP tasks. As a consequence, there has been increasing interest in understanding to what extent they process language as humans do. We aim here to uncover which biases such models display with respect to “natural” word-order constraints. We train models to communicate about paths in a simple gridworld, using miniature languages that reflect or violate various natural language trends, such as the tendency to avoid redundancy or to minimize long-distance dependencies. We study how the controlled characteristics of our miniature languages affect individual learning and their stability across multiple network generations. The results draw a mixed picture. On the one hand, neural networks show a strong tendency to avoid long-distance dependencies. On the other hand, there is no clear preference for the efficient, non-redundant encoding of information that is widely attested in natural language. We thus suggest inoculating a notion of “effort” into neural networks, as a possible way to make their linguistic behavior more human-like.

pdf bib
SyntaxFest 2019 Invited talk - Inductive biases and language emergence in communicative agents
Emmanuel Dupoux
Proceedings of the Fifth International Conference on Dependency Linguistics (Depling, SyntaxFest 2019)

2018

pdf bib
BabyCloud, a Technological Platform for Parents and Researchers
Xuân-Nga Cao | Cyrille Dakhlia | Patricia Del Carmen | Mohamed-Amine Jaouani | Malik Ould-Arbi | Emmanuel Dupoux
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
The Role of Prosody and Speech Register in Word Segmentation: A Computational Modelling Perspective
Bogdan Ludusan | Reiko Mazuka | Mathieu Bernard | Alejandrina Cristia | Emmanuel Dupoux
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

This study explores the role of speech register and prosody for the task of word segmentation. Since these two factors are thought to play an important role in early language acquisition, we aim to quantify their contribution for this task. We study a Japanese corpus containing both infant- and adult-directed speech and we apply four different word segmentation models, with and without knowledge of prosodic boundaries. The results showed that the difference between registers is smaller than previously reported and that prosodic boundary information helps more adult- than infant-directed speech.

pdf bib
Blind Phoneme Segmentation With Temporal Prediction Errors
Paul Michel | Okko Rasanen | Roland Thiollière | Emmanuel Dupoux
Proceedings of ACL 2017, Student Research Workshop

pdf bib
Comparing Character-level Neural Language Models Using a Lexical Decision Task
Gaël Le Godais | Tal Linzen | Emmanuel Dupoux
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

What is the information captured by neural network models of language? We address this question in the case of character-level recurrent neural language models. These models do not have explicit word representations; do they acquire implicit ones? We assess the lexical capacity of a network using the lexical decision task common in psycholinguistics: the system is required to decide whether or not a string of characters forms a word. We explore how accuracy on this task is affected by the architecture of the network, focusing on cell type (LSTM vs. SRN), depth and width. We also compare these architectural properties to a simple count of the parameters of the network. The overall number of parameters in the network turns out to be the most important predictor of accuracy; in particular, there is little evidence that deeper networks are beneficial for this task.

2016

pdf bib
Quantificational features in distributional word representations
Tal Linzen | Emmanuel Dupoux | Benjamin Spector
Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics

pdf bib
Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies
Tal Linzen | Emmanuel Dupoux | Yoav Goldberg
Transactions of the Association for Computational Linguistics, Volume 4

The success of long short-term memory (LSTM) neural networks in language processing is typically attributed to their ability to capture long-distance statistical regularities. Linguistic regularities are often sensitive to syntactic structure; can such dependencies be captured by LSTMs, which do not have explicit structural representations? We begin addressing this question using number agreement in English subject-verb dependencies. We probe the architecture’s grammatical competence both using training objectives with an explicit grammatical target (number prediction, grammaticality judgments) and using language models. In the strongly supervised settings, the LSTM achieved very high overall accuracy (less than 1% errors), but errors increased when sequential and structural information conflicted. The frequency of such errors rose sharply in the language-modeling setting. We conclude that LSTMs can capture a non-trivial amount of grammatical structure given targeted supervision, but stronger architectures may be required to further reduce errors; furthermore, the language modeling signal is insufficient for capturing syntax-sensitive dependencies, and should be supplemented with more direct supervision if such dependencies need to be captured.

2015

pdf bib
Motif discovery in infant- and adult-directed speech
Bogdan Ludusan | Amanda Seidl | Emmanuel Dupoux | Alex Cristia
Proceedings of the Sixth Workshop on Cognitive Aspects of Computational Language Learning

pdf bib
Sign constraints on feature weights improve a joint model of word segmentation and phonology
Mark Johnson | Joe Pater | Robert Staubs | Emmanuel Dupoux
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Prosodic boundary information helps unsupervised word segmentation
Bogdan Ludusan | Gabriel Synnaeve | Emmanuel Dupoux
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2014

pdf bib
Bridging the gap between speech technology and natural language processing: an evaluation toolbox for term discovery systems
Bogdan Ludusan | Maarten Versteegh | Aren Jansen | Guillaume Gravier | Xuan-Nga Cao | Mark Johnson | Emmanuel Dupoux
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The unsupervised discovery of linguistic terms from either continuous phoneme transcriptions or from raw speech has seen an increasing interest in the past years both from a theoretical and a practical standpoint. Yet, there exists no common accepted evaluation method for the systems performing term discovery. Here, we propose such an evaluation toolbox, drawing ideas from both speech technology and natural language processing. We first transform the speech-based output into a symbolic representation and compute five types of evaluation metrics on this representation: the quality of acoustic matching, the quality of the clusters found, and the quality of the alignment with real words (type, token, and boundary scores). We tested our approach on two term discovery systems taking speech as input, and one using symbolic input. The latter was run using both the gold transcription and a transcription obtained from an automatic speech recognizer, in order to simulate the case when only imperfect symbolic information is available. The results obtained are analysed through the use of the proposed evaluation metrics and the implications of these metrics are discussed.

pdf bib
A Rudimentary Lexicon and Semantics Help Bootstrap Phoneme Acquisition
Abdellah Fourtassi | Emmanuel Dupoux
Proceedings of the Eighteenth Conference on Computational Natural Language Learning

pdf bib
Modelling function words improves unsupervised word segmentation
Mark Johnson | Anne Christophe | Emmanuel Dupoux | Katherine Demuth
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Exploring the Relative Role of Bottom-up and Top-down Information in Phoneme Learning
Abdellah Fourtassi | Thomas Schatz | Balakrishnan Varadarajan | Emmanuel Dupoux
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Unsupervised Word Segmentation in Context
Gabriel Synnaeve | Isabelle Dautriche | Benjamin Börschinger | Mark Johnson | Emmanuel Dupoux
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2013

pdf bib
Why is English so easy to segment?
Abdellah Fourtassi | Benjamin Börschinger | Mark Johnson | Emmanuel Dupoux
Proceedings of the Fourth Annual Workshop on Cognitive Modeling and Computational Linguistics (CMCL)

pdf bib
A corpus-based evaluation method for Distributional Semantic Models
Abdellah Fourtassi | Emmanuel Dupoux
51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop

2011

pdf bib
Testing the Robustness of Online Word Segmentation: Effects of Linguistic Diversity and Phonetic Variation
Luc Boruta | Sharon Peperkamp | Benoît Crabbé | Emmanuel Dupoux
Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics

2008

pdf bib
Unsupervised Learning of Acoustic Sub-word Units
Balakrishnan Varadarajan | Sanjeev Khudanpur | Emmanuel Dupoux
Proceedings of ACL-08: HLT, Short Papers