Do self-supervised speech models develop human-like perception biases?

Juliette Millet; Ewan Dunbar

doi:10.18653/v1/2022.acl-long.523

Do self-supervised speech models develop human-like perception biases?

Abstract

Self-supervised models for speech processing form representational spaces without using any external labels. Increasingly, they appear to be a feasible way of at least partially eliminating costly manual annotations, a problem of particular concern for low-resource languages. But what kind of representational spaces do these models construct?Human perception specializes to the sounds of listeners’ native languages. Does the same thing happen in self-supervised models? We examine the representational spaces of three kinds of state of the art self-supervised models: wav2vec, HuBERT and contrastive predictive coding (CPC), and compare them with the perceptual spaces of French-speaking and English-speaking human listeners, both globally and taking account of the behavioural differences between the two language groups. We show that the CPC model shows a small native language effect, but that wav2vec and HuBERT seem to develop a universal speech perception space which is not language specific. A comparison against the predictions of supervised phone recognisers suggests that all three self-supervised models capture relatively fine-grained perceptual phenomena, while supervised models are better at capturing coarser, phone-level effects, and effects of listeners’ native language, on perception.

Anthology ID:: 2022.acl-long.523
Volume:: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: May
Year:: 2022
Address:: Dublin, Ireland
Editors:: Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7591–7605
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2022.acl-long.523/
DOI:: 10.18653/v1/2022.acl-long.523
Bibkey:
Cite (ACL):: Juliette Millet and Ewan Dunbar. 2022. Do self-supervised speech models develop human-like perception biases?. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7591–7605, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):: Do self-supervised speech models develop human-like perception biases? (Millet & Dunbar, ACL 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2022.acl-long.523.pdf
Video:: https://preview.aclanthology.org/fix-sig-urls/2022.acl-long.523.mp4
Data: AudioSet, LibriSpeech

PDF Cite Search Video Fix data