Abstract
Recent progress in Spoken Language Modeling has shown that learning language directly from speech is feasible. Generating speech through a pipeline that operates at the text level typically loses nuances, intonations, and non-verbal vocalizations. Modeling directly from speech opens up the path to more natural and expressive systems. On the other hand, speech-only systems require up to three orders of magnitude more data to catch up to their text-based counterparts in terms of their semantic abilities. We show that fine-tuning speech representation models on phoneme classification leads to more context-invariant representations, and language models trained on these units achieve comparable lexical comprehension to ones trained on hundred times more data.- Anthology ID:
- 2024.emnlp-main.302
- Volume:
- Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5284–5292
- Language:
- URL:
- https://aclanthology.org/2024.emnlp-main.302
- DOI:
- 10.18653/v1/2024.emnlp-main.302
- Cite (ACL):
- Maxime Poli, Emmanuel Chemla, and Emmanuel Dupoux. 2024. Improving Spoken Language Modeling with Phoneme Classification: A Simple Fine-tuning Approach. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5284–5292, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- Improving Spoken Language Modeling with Phoneme Classification: A Simple Fine-tuning Approach (Poli et al., EMNLP 2024)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2024.emnlp-main.302.pdf