Abstract
Sounding source localization is a challenging cross-modal task due to the difficulty of cross-modal alignment. Although supervised cross-modal methods achieve encouraging performance, heavy manual annotations are expensive and inefficient. Thus it is valuable and meaningful to develop unsupervised solutions. In this paper, we propose an **U**nsupervised **S**ounding **P**ixel **L**earning (USPL) approach which enables a pixel-level sounding source localization in unsupervised paradigm. We first design a mask augmentation based multi-instance contrastive learning to realize unsupervised cross-modal coarse localization, which aligns audio-visual features to obtain coarse sounding maps. Secondly, we present an *Unsupervised Sounding Map Refinement (SMR)* module which employs the visual semantic affinity learning to explore inter-pixel relations of adjacent coordinate features. It contributes to recovering the boundary of coarse sounding maps and obtaining fine sounding maps. Finally, a *Sounding Pixel Segmentation (SPS)* module is presented to realize audio-supervised semantic segmentation. Extensive experiments are performed on the AVSBench-S4 and VGGSound datasets, exhibiting encouraging results compared with previous SOTA methods.- Anthology ID:
- 2023.emnlp-main.777
- Volume:
- Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Houda Bouamor, Juan Pino, Kalika Bali
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 12610–12620
- Language:
- URL:
- https://aclanthology.org/2023.emnlp-main.777
- DOI:
- 10.18653/v1/2023.emnlp-main.777
- Cite (ACL):
- Yining Zhang, Yanli Ji, and Yang Yang. 2023. Unsupervised Sounding Pixel Learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12610–12620, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- Unsupervised Sounding Pixel Learning (Zhang et al., EMNLP 2023)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-5/2023.emnlp-main.777.pdf