Open Your Model’s Eyes: Video and Context-Aware Multimodal Backchannel Prediction

Min-Jae Kim, Jun-Yeong Moon, Mujeen Sung, Gyeong-Moon Park


Abstract
Backchannels, which signal listener states like empathy and understanding, are fundamental to natural human interaction. However, current approaches rely solely on audio and text. This omits crucial visual cues, such as facial expressions and gestures, as well as broader conversational contexts, which are necessary for accurate prediction. In this paper, we introduce Context-Aware Multimodal Alignment for Backchannel Prediction (CAMA-BC), a novel framework that leverages visual information through Multi-layer Multimodal Alignment (MMA). Our alignment process comprises two stages. First, Context Alignment (MMA-CA) utilizes unlabeled dialogues with videos to capture conversational contexts. Next, Backchannel Alignment (MMA-BA) fine-tunes the representations specifically for backchannel prediction. Experimental results show that CAMA-BC significantly outperforms both existing methods and simple multimodal baselines, with particular effectiveness in recognizing complex backchannels such as empathy.
Anthology ID:
2026.acl-long.171
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3738–3755
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.171/
DOI:
Bibkey:
Cite (ACL):
Min-Jae Kim, Jun-Yeong Moon, Mujeen Sung, and Gyeong-Moon Park. 2026. Open Your Model’s Eyes: Video and Context-Aware Multimodal Backchannel Prediction. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3738–3755, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Open Your Model’s Eyes: Video and Context-Aware Multimodal Backchannel Prediction (Kim et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.171.pdf
Checklist:
 2026.acl-long.171.checklist.pdf