@inproceedings{tian-etal-2022-dual,
    title = "Dual Capsule Attention Mask Network with Mutual Learning for Visual Question Answering",
    author = "Tian, Weidong  and
      Li, Haodong  and
      Zhao, Zhong-Qiu",
    editor = "Calzolari, Nicoletta  and
      Huang, Chu-Ren  and
      Kim, Hansaem  and
      Pustejovsky, James  and
      Wanner, Leo  and
      Choi, Key-Sun  and
      Ryu, Pum-Mo  and
      Chen, Hsin-Hsi  and
      Donatelli, Lucia  and
      Ji, Heng  and
      Kurohashi, Sadao  and
      Paggio, Patrizia  and
      Xue, Nianwen  and
      Kim, Seokhwan  and
      Hahm, Younggyun  and
      He, Zhong  and
      Lee, Tony Kyungil  and
      Santus, Enrico  and
      Bond, Francis  and
      Na, Seung-Hoon",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://preview.aclanthology.org/ingest-emnlp/2022.coling-1.500/",
    pages = "5678--5688",
    abstract = "A Visual Question Answering (VQA) model processes images and questions simultaneously with rich semantic information. The attention mechanism can highlight fine-grained features with critical information, thus ensuring that feature extraction emphasizes the objects related to the questions. However, unattended coarse-grained information is also essential for questions involving global elements. We believe that global coarse-grained information and local fine-grained information can complement each other to provide richer comprehensive information. In this paper, we propose a dual capsule attention mask network with mutual learning for VQA. Specifically, it contains two branches processing coarse-grained features and fine-grained features, respectively. We also design a novel stackable dual capsule attention module to fuse features and locate evidence. The two branches are combined to make final predictions for VQA. Experimental results show that our method outperforms the baselines in terms of VQA performance and interpretability and achieves new SOTA performance on the VQA-v2 dataset."
}Markdown (Informal)
[Dual Capsule Attention Mask Network with Mutual Learning for Visual Question Answering](https://preview.aclanthology.org/ingest-emnlp/2022.coling-1.500/) (Tian et al., COLING 2022)
ACL