Towards Noise-Tolerant Speech-Referring Video Object Segmentation: Bridging Speech and Text
Xiang Li, Jinglu Wang, Xiaohao Xu, Muqiao Yang, Fan Yang, Yizhou Zhao, Rita Singh, Bhiksha Raj
Abstract
Linguistic communication is prevalent in Human-Computer Interaction (HCI). Speech (spoken language) serves as a convenient yet potentially ambiguous form due to noise and accents, exposing a gap compared to text. In this study, we investigate the prominent HCI task, Referring Video Object Segmentation (R-VOS), which aims to segment and track objects using linguistic references. While text input is well-investigated, speech input is under-explored. Our objective is to bridge the gap between speech and text, enabling the adaptation of existing text-input R-VOS models to accommodate noisy speech input effectively. Specifically, we propose a method to align the semantic spaces between speech and text by incorporating two key modules: 1) Noise-Aware Semantic Adjustment (NSA) for clear semantics extraction from noisy speech; and 2) Semantic Jitter Suppression (SJS) enabling R-VOS models to tolerate noisy queries. Comprehensive experiments conducted on the challenging AVOS benchmarks reveal that our proposed method outperforms state-of-the-art approaches.- Anthology ID:
- 2023.emnlp-main.140
- Volume:
- Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Houda Bouamor, Juan Pino, Kalika Bali
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 2283–2296
- Language:
- URL:
- https://aclanthology.org/2023.emnlp-main.140
- DOI:
- 10.18653/v1/2023.emnlp-main.140
- Cite (ACL):
- Xiang Li, Jinglu Wang, Xiaohao Xu, Muqiao Yang, Fan Yang, Yizhou Zhao, Rita Singh, and Bhiksha Raj. 2023. Towards Noise-Tolerant Speech-Referring Video Object Segmentation: Bridging Speech and Text. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2283–2296, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- Towards Noise-Tolerant Speech-Referring Video Object Segmentation: Bridging Speech and Text (Li et al., EMNLP 2023)
- PDF:
- https://preview.aclanthology.org/ingest-acl-2023-videos/2023.emnlp-main.140.pdf