RATION: Entropy-Driven Task-Adaptive Visual Attention Allocation Framework for Multimodal Reasoning
Xingle Xu, Fanheng Kong, Dexian Cai, Shi Feng, Xiaocui Yang, Daling Wang, Yifei Zhang
Abstract
Multimodal Large Language Models (MLLMs) integrate visual encoders with Large Language Models (LLMs) and enable multimodal reasoning. However, for tasks that heavily rely on visual information, the model’s utilization of visual information remains unstable, which leads to reasoning failures. Prior works mainly strengthen multimodal reasoning by improving representation alignment or increasing computation. However, these methods do not explicitly characterize the differences in visual demands across tasks, making it difficult for the model to decide where and how strongly to attend to visual information. Consequently, visual attention allocation becomes a key factor that affects multimodal reasoning. To address these, we propose RATION, an entropy-driven task-adaptive visual attention allocation framework. First, we use a task routing strategy to infer the task type of each sample and identify the key layers. We use visual attention entropy as a control signal to dynamically allocate attention according to task demands. Experiments show that RATION achieves consistent performance gains across diverse reasoning tasks, datasets, and models, providing a clear direction toward more reliable multimodal reasoning.- Anthology ID:
- 2026.findings-acl.1238
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 24726–24744
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1238/
- DOI:
- Cite (ACL):
- Xingle Xu, Fanheng Kong, Dexian Cai, Shi Feng, Xiaocui Yang, Daling Wang, and Yifei Zhang. 2026. RATION: Entropy-Driven Task-Adaptive Visual Attention Allocation Framework for Multimodal Reasoning. In Findings of the Association for Computational Linguistics: ACL 2026, pages 24726–24744, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- RATION: Entropy-Driven Task-Adaptive Visual Attention Allocation Framework for Multimodal Reasoning (Xu et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1238.pdf