Rory Beard


2018

pdf bib
Multi-Modal Sequence Fusion via Recursive Attention for Emotion Recognition
Rory Beard | Ritwik Das | Raymond W. M. Ng | P. G. Keerthana Gopalakrishnan | Luka Eerens | Pawel Swietojanski | Ondrej Miksik
Proceedings of the 22nd Conference on Computational Natural Language Learning

Natural human communication is nuanced and inherently multi-modal. Humans possess specialised sensoria for processing vocal, visual, and linguistic, and para-linguistic information, but form an intricately fused percept of the multi-modal data stream to provide a holistic representation. Analysis of emotional content in face-to-face communication is a cognitive task to which humans are particularly attuned, given its sociological importance, and poses a difficult challenge for machine emulation due to the subtlety and expressive variability of cross-modal cues. Inspired by the empirical success of recent so-called End-To-End Memory Networks and related works, we propose an approach based on recursive multi-attention with a shared external memory updated over multiple gated iterations of analysis. We evaluate our model across several large multi-modal datasets and show that global contextualised memory with gated memory update can effectively achieve emotion recognition.