On Generative Spoken Language Modeling from Raw Audio

Kushal Lakhotia; Eugene Kharitonov; Wei-Ning Hsu; Yossi Adi; Adam Polyak; Benjamin Bolte; Tu-Anh Nguyen; Jade Copet; Alexei Baevski; Abdelrahman Mohamed; Emmanuel Dupoux

doi:10.1162/tacl_a_00430

On Generative Spoken Language Modeling from Raw Audio

Kushal Lakhotia, Eugene Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Abdelrahman Mohamed, Emmanuel Dupoux

Abstract

We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo- text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder- dependent way, and that some combinations approach text-based systems.1

Anthology ID:: 2021.tacl-1.79
Volume:: Transactions of the Association for Computational Linguistics, Volume 9
Month:
Year:: 2021
Address:: Cambridge, MA
Editors:: Brian Roark, Ani Nenkova
Venue:: TACL
SIG:
Publisher:: MIT Press
Note:
Pages:: 1336–1354
Language:
URL:: https://aclanthology.org/2021.tacl-1.79
DOI:: 10.1162/tacl_a_00430
Bibkey:
Cite (ACL):: Kushal Lakhotia, Eugene Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Abdelrahman Mohamed, and Emmanuel Dupoux. 2021. On Generative Spoken Language Modeling from Raw Audio. Transactions of the Association for Computational Linguistics, 9:1336–1354.
Cite (Informal):: On Generative Spoken Language Modeling from Raw Audio (Lakhotia et al., TACL 2021)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-1/2021.tacl-1.79.pdf
Video:: https://preview.aclanthology.org/nschneid-patch-1/2021.tacl-1.79.mp4

PDF Search Video