@inproceedings{zhang-etal-2024-unsupervised,
    title = "Unsupervised Multimodal Clustering for Semantics Discovery in Multimodal Utterances",
    author = "Zhang, Hanlei  and
      Xu, Hua  and
      Long, Fei  and
      Wang, Xin  and
      Gao, Kai",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://preview.aclanthology.org/ingest-emnlp/2024.acl-long.2/",
    doi = "10.18653/v1/2024.acl-long.2",
    pages = "18--35",
    abstract = "Discovering the semantics of multimodal utterances is essential for understanding human language and enhancing human-machine interactions. Existing methods manifest limitations in leveraging nonverbal information for discerning complex semantics in unsupervised scenarios. This paper introduces a novel unsupervised multimodal clustering method (UMC), making a pioneering contribution to this field. UMC introduces a unique approach to constructing augmentation views for multimodal data, which are then used to perform pre-training to establish well-initialized representations for subsequent clustering. An innovative strategy is proposed to dynamically select high-quality samples as guidance for representation learning, gauged by the density of each sample{'}s nearest neighbors. Besides, it is equipped to automatically determine the optimal value for the top-$K$ parameter in each cluster to refine sample selection. Finally, both high- and low-quality samples are used to learn representations conducive to effective clustering. We build baselines on benchmark multimodal intent and dialogue act datasets. UMC shows remarkable improvements of 2-6{\%} scores in clustering metrics over state-of-the-art methods, marking the first successful endeavor in this domain. The complete code and data are available at https://github.com/thuiar/UMC."
}Markdown (Informal)
[Unsupervised Multimodal Clustering for Semantics Discovery in Multimodal Utterances](https://preview.aclanthology.org/ingest-emnlp/2024.acl-long.2/) (Zhang et al., ACL 2024)
ACL