Adam Polyak


2021

pdf bib
On Generative Spoken Language Modeling from Raw Audio
Kushal Lakhotia | Eugene Kharitonov | Wei-Ning Hsu | Yossi Adi | Adam Polyak | Benjamin Bolte | Tu-Anh Nguyen | Jade Copet | Alexei Baevski | Abdelrahman Mohamed | Emmanuel Dupoux
Transactions of the Association for Computational Linguistics, Volume 9

Abstract We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo- text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder- dependent way, and that some combinations approach text-based systems.1

pdf bib
fairseq Sˆ2: A Scalable and Integrable Speech Synthesis Toolkit
Changhan Wang | Wei-Ning Hsu | Yossi Adi | Adam Polyak | Ann Lee | Peng-Jen Chen | Jiatao Gu | Juan Pino
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

This paper presents fairseq Sˆ2, a fairseq extension for speech synthesis. We implement a number of autoregressive (AR) and non-AR text-to-speech models, and their multi-speaker variants. To enable training speech synthesis models with less curated data, a number of preprocessing tools are built and their importance is shown empirically. To facilitate faster iteration of development and analysis, a suite of automatic metrics is included. Apart from the features added specifically for this extension, fairseq Sˆ2 also benefits from the scalability offered by fairseq and can be easily integrated with other state-of-the-art systems provided in this framework. The code, documentation, and pre-trained models will be made available at https://github.com/pytorch/fairseq/tree/master/examples/speech_synthesis.