RATION: Entropy-Driven Task-Adaptive Visual Attention Allocation Framework for Multimodal Reasoning

Xingle Xu, Fanheng Kong, Dexian Cai, Shi Feng, Xiaocui Yang, Daling Wang, Yifei Zhang


Abstract
Multimodal Large Language Models (MLLMs) integrate visual encoders with Large Language Models (LLMs) and enable multimodal reasoning. However, for tasks that heavily rely on visual information, the model’s utilization of visual information remains unstable, which leads to reasoning failures. Prior works mainly strengthen multimodal reasoning by improving representation alignment or increasing computation. However, these methods do not explicitly characterize the differences in visual demands across tasks, making it difficult for the model to decide where and how strongly to attend to visual information. Consequently, visual attention allocation becomes a key factor that affects multimodal reasoning. To address these, we propose RATION, an entropy-driven task-adaptive visual attention allocation framework. First, we use a task routing strategy to infer the task type of each sample and identify the key layers. We use visual attention entropy as a control signal to dynamically allocate attention according to task demands. Experiments show that RATION achieves consistent performance gains across diverse reasoning tasks, datasets, and models, providing a clear direction toward more reliable multimodal reasoning.
Anthology ID:
2026.findings-acl.1238
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
24726–24744
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1238/
DOI:
Bibkey:
Cite (ACL):
Xingle Xu, Fanheng Kong, Dexian Cai, Shi Feng, Xiaocui Yang, Daling Wang, and Yifei Zhang. 2026. RATION: Entropy-Driven Task-Adaptive Visual Attention Allocation Framework for Multimodal Reasoning. In Findings of the Association for Computational Linguistics: ACL 2026, pages 24726–24744, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
RATION: Entropy-Driven Task-Adaptive Visual Attention Allocation Framework for Multimodal Reasoning (Xu et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1238.pdf
Checklist:
 2026.findings-acl.1238.checklist.pdf