Alex James Boyd
2025
Focus on What Matters: Enhancing Medical Vision-Language Models with Automatic Attention Alignment Tuning
Aofei Chang
|
Le Huang
|
Alex James Boyd
|
Parminder Bhatia
|
Taha Kass-Hout
|
Cao Xiao
|
Fenglong Ma
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Medical Large Vision-Language Models (Med-LVLMs) often exhibit suboptimal attention distribution on visual inputs, leading to hallucinated or inaccurate outputs. Existing methods primarily rely on inference-time interventions, which are limited in attention adaptation or require additional supervision. To address this, we propose A3Tune, a novel fine-tuning framework for Automatic Attention Alignment Tuning. ATune leverages zero-shot weak labels from SAM, refines them into prompt-aware labels using BioMedCLIP, and then selectively modifies visually-critical attention heads to improve alignment while minimizing interference. Additionally, we introduce a A3MoE module, enabling adaptive parameter selection for attention tuning across diverse prompts and images. Extensive experiments on medical VQA and report generation benchmarks show that A3Tune outperforms state-of-the-art baselines, achieving enhanced attention distributions and performance in Med-LVLMs.
Search
Fix author
Co-authors
- Parminder Bhatia 1
- Aofei Chang 1
- Le Huang 1
- Taha Kass-Hout 1
- Fenglong Ma 1
- show all...
- Cao Xiao 1
Venues
- acl1