Zhongfei Zhang


2020

pdf
Dual Low-Rank Multimodal Fusion
Tao Jin | Siyu Huang | Yingming Li | Zhongfei Zhang
Findings of the Association for Computational Linguistics: EMNLP 2020

Tensor-based fusion methods have been proven effective in multimodal fusion tasks. However, existing tensor-based methods make a poor use of the fine-grained temporal dynamics of multimodal sequential features. Motivated by this observation, this paper proposes a novel multimodal fusion method called Fine-Grained Temporal Low-Rank Multimodal Fusion (FT-LMF). FT-LMF correlates the features of individual time steps between multiple modalities, while it involves multiplications of high-order tensors in its calculation. This paper further proposes Dual Low-Rank Multimodal Fusion (Dual-LMF) to reduce the computational complexity of FT-LMF through low-rank tensor approximation along dual dimensions of input features. Dual-LMF is conceptually simple and practically effective and efficient. Empirical studies on benchmark multimodal analysis tasks show that our proposed methods outperform the state-of-the-art tensor-based fusion methods with a similar computational complexity.

pdf
BERT-enhanced Relational Sentence Ordering Network
Baiyun Cui | Yingming Li | Zhongfei Zhang
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

In this paper, we introduce a novel BERT-enhanced Relational Sentence Ordering Network (referred to as BRSON) by leveraging BERT for capturing better dependency relationship among sentences to enhance the coherence modeling for the entire paragraph. In particular, we develop a new Relational Pointer Decoder (referred as RPD) by incorporating the relative ordering information into the pointer network with a Deep Relational Module (referred as DRM), which utilizes BERT to exploit the deep semantic connection and relative ordering between sentences.This enables us to strengthen both local and global dependencies among sentences. Extensive evaluations are conducted on six public datasets. The experimental results demonstrate the effectiveness and promise of our BRSON, showing a significant improvement over the state-of-the-art by a wide margin.

2019

pdf
Low-Rank HOCA: Efficient High-Order Cross-Modal Attention for Video Captioning
Tao Jin | Siyu Huang | Yingming Li | Zhongfei Zhang
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

This paper addresses the challenging task of video captioning which aims to generate descriptions for video data. Recently, the attention-based encoder-decoder structures have been widely used in video captioning. In existing literature, the attention weights are often built from the information of an individual modality, while, the association relationships between multiple modalities are neglected. Motivated by this, we propose a video captioning model with High-Order Cross-Modal Attention (HOCA) where the attention weights are calculated based on the high-order correlation tensor to capture the frame-level cross-modal interaction of different modalities sufficiently. Furthermore, we novelly introduce Low-Rank HOCA which adopts tensor decomposition to reduce the extremely large space requirement of HOCA, leading to a practical and efficient implementation in real-world applications. Experimental results on two benchmark datasets, MSVD and MSR-VTT, show that Low-rank HOCA establishes a new state-of-the-art.

pdf
Fine-tune BERT with Sparse Self-Attention Mechanism
Baiyun Cui | Yingming Li | Ming Chen | Zhongfei Zhang
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

In this paper, we develop a novel Sparse Self-Attention Fine-tuning model (referred as SSAF) which integrates sparsity into self-attention mechanism to enhance the fine-tuning performance of BERT. In particular, sparsity is introduced into the self-attention by replacing softmax function with a controllable sparse transformation when fine-tuning with BERT. It enables us to learn a structurally sparse attention distribution, which leads to a more interpretable representation for the whole input. The proposed model is evaluated on sentiment analysis, question answering, and natural language inference tasks. The extensive experimental results across multiple datasets demonstrate its effectiveness and superiority to the baseline methods.

2018

pdf
Deep Attentive Sentence Ordering Network
Baiyun Cui | Yingming Li | Ming Chen | Zhongfei Zhang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

In this paper, we propose a novel deep attentive sentence ordering network (referred as ATTOrderNet) which integrates self-attention mechanism with LSTMs in the encoding of input sentences. It enables us to capture global dependencies among sentences regardless of their input order and obtains a reliable representation of the sentence set. With this representation, a pointer network is exploited to generate an ordered sequence. The proposed model is evaluated on Sentence Ordering and Order Discrimination tasks. The extensive experimental results demonstrate its effectiveness and superiority to the state-of-the-art methods.