Guiding the Flowing of Semantics: Interpretable Video Captioning via POS Tag

Xinyu Xiao, Lingfeng Wang, Bin Fan, Shinming Xiang, Chunhong Pan


Abstract
In the current video captioning models, the video frames are collected in one network and the semantics are mixed into one feature, which not only increase the difficulty of the caption decoding, but also decrease the interpretability of the captioning models. To address these problems, we propose an Adaptive Semantic Guidance Network (ASGN), which instantiates the whole video semantics to different POS-aware semantics with the supervision of part of speech (POS) tag. In the encoding process, the POS tag activates the related neurons and parses the whole semantic information into corresponding encoded video representations. Furthermore, the potential of the model is stimulated by the POS-aware video features. In the decoding process, the related video features of noun and verb are used as the supervision to construct a new adaptive attention model which can decide whether to attend to the video feature or not. With the explicit improving of the interpretability of the network, the learning process is more transparent and the results are more predictable. Extensive experiments demonstrate the effectiveness of our model when compared with state-of-the-art models.
Anthology ID:
D19-1213
Volume:
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Month:
November
Year:
2019
Address:
Hong Kong, China
Editors:
Kentaro Inui, Jing Jiang, Vincent Ng, Xiaojun Wan
Venues:
EMNLP | IJCNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
2068–2077
Language:
URL:
https://aclanthology.org/D19-1213
DOI:
10.18653/v1/D19-1213
Bibkey:
Cite (ACL):
Xinyu Xiao, Lingfeng Wang, Bin Fan, Shinming Xiang, and Chunhong Pan. 2019. Guiding the Flowing of Semantics: Interpretable Video Captioning via POS Tag. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2068–2077, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):
Guiding the Flowing of Semantics: Interpretable Video Captioning via POS Tag (Xiao et al., EMNLP-IJCNLP 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/ml4al-ingestion/D19-1213.pdf