Zero-Shot End-to-End Spoken Language Understanding via Cross-Modal Selective Self-Training

Jianfeng He, Julian Salazar, Kaisheng Yao, Haoqi Li, Jason Cai


Abstract
End-to-end (E2E) spoken language understanding (SLU) is constrained by the cost of collecting speech-semantics pairs, especially when label domains change. Hence, we explore zero-shot E2E SLU, which learns E2E SLU without speech-semantics pairs, instead using only speech-text and text-semantics pairs. Previous work achieved zero-shot by pseudolabeling all speech-text transcripts with a natural language understanding (NLU) model learned on text-semantics corpora. However, this method requires the domains of speech-text and text-semantics to match, which often mismatch due to separate collections. Furthermore, using the entire collected speech-text corpus from any domains leads to imbalance and noise issues. To address these, we propose cross-modal selective self-training (CMSST). CMSST tackles imbalance by clustering in a joint space of the three modalities (speech, text, and semantics) and handles label noise with a selection network. We also introduce two benchmarks for zero-shot E2E SLU, covering matched and found speech (mismatched) settings. Experiments show that CMSST improves performance in both two settings, with significantly reduced sample sizes and training time. Our code and data are released in https://github.com/amazon-science/zero-shot-E2E-slu.
Anthology ID:
2024.eacl-long.137
Volume:
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
March
Year:
2024
Address:
St. Julian’s, Malta
Editors:
Yvette Graham, Matthew Purver
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2239–2256
Language:
URL:
https://aclanthology.org/2024.eacl-long.137
DOI:
Bibkey:
Cite (ACL):
Jianfeng He, Julian Salazar, Kaisheng Yao, Haoqi Li, and Jason Cai. 2024. Zero-Shot End-to-End Spoken Language Understanding via Cross-Modal Selective Self-Training. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2239–2256, St. Julian’s, Malta. Association for Computational Linguistics.
Cite (Informal):
Zero-Shot End-to-End Spoken Language Understanding via Cross-Modal Selective Self-Training (He et al., EACL 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-5/2024.eacl-long.137.pdf
Software:
 2024.eacl-long.137.software.zip
Note:
 2024.eacl-long.137.note.zip
Video:
 https://preview.aclanthology.org/nschneid-patch-5/2024.eacl-long.137.mp4