Coarse-to-Fine Multimodal Information Selection for Video Speaking Style Recognition with Large Language Models

Beibei Zhang, Yanan Lu, Lin Fen, Tongwei Ren


Abstract
Video Speaking Style Recognition (VSSR) aims to classify conversation videos into different types, significantly facilitating human interaction understanding. Recent approaches explore the potential of large language models (LLM) in VSSR with a training-free process. However, directly integrating all multimodal data yields suboptimal results, since the great redundancy in visual data can overshadow other valuable multimodal information, such as valuable textual dialogues and critical visual clues. To address this, we propose CFMiS (Coarse-to-Fine Multimodal Information Selection), a novel framework for VSSR that dynamically obtain valuable multimodal data via coarse-to-fine selection, enhancing LLM reasoning for VSSR. Specifically, the core of CFMiS are two cascaded modules: 1) a text-dominant modality selection module firstly selects VSSR-required modalities originating from text-based prediction; and 2) if vision is included in the selected modalities, a visual refinement module iteratively collects VSSR-relevant critical visual clues. The former resolves which modality to utilize, while the latter determines which information to adopt from selected modalities, efficiently alleviating information redundancy. Extensive experiments on multiple datasets prove that CFMiS is highly effective for VSSR, outperforming all existing training-free approaches and most training-based methods.
Anthology ID:
2026.findings-acl.1466
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
29322–29337
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1466/
DOI:
Bibkey:
Cite (ACL):
Beibei Zhang, Yanan Lu, Lin Fen, and Tongwei Ren. 2026. Coarse-to-Fine Multimodal Information Selection for Video Speaking Style Recognition with Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2026, pages 29322–29337, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Coarse-to-Fine Multimodal Information Selection for Video Speaking Style Recognition with Large Language Models (Zhang et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1466.pdf
Checklist:
 2026.findings-acl.1466.checklist.pdf