Adaptive Weighted Proxy Tuning: Efficient Gray-Box Steering for Image Captioning.

Nafew Azim, Fuad Rahman, Nabeel Mohammed


Abstract
Adapting Large Vision-Language Models (LVLMs) to specialized domains typically demands resource-intensive fine-tuning or access to proprietary parameters (“white-box” access). While decoding-time strategies like Proxy Tuning offer a parameter-efficient alternative, they rely on rigid, static logit arithmetic that fails to account for instance-specific variations in model certainty and domain shift. In this work, we introduce Adaptive Weighted Proxy Tuning (AWPT), a gray-box steering framework that dynamically modulates the logit contributions of a large base model, a fine-tuned expert, and an untuned anti-expert. Unlike static approaches, AWPT introduces two instance-aware mechanisms: (1) a lightweight ViT-based Weight Predictor that performs amortized inference to estimate optimal mixing coefficients in real-time with negligible added latency (0.03s overhead), and (2) a Per-Sample Optimization objective that establishes theoretical performance bounds via gradient-based logit steering. Extensive evaluation across medical (ROCOv2, IU-Xray) and general domains (Flickr30k, MS COCO, TextCaps) demonstrates that AWPT achieves performance parity with fully fine-tuned models while remaining parameter-free regarding the generator. Crucially, our dynamic weighting acts as an effective regularizer, significantly reducing object hallucinations and establishing AWPT as a robust solution for deploying general-purpose LVLMs in safety-critical contexts.
Anthology ID:
2026.acl-industry.85
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Yunyao Li, Georg Rehm, Mei Tu
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1197–1217
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-industry.85/
DOI:
Bibkey:
Cite (ACL):
Nafew Azim, Fuad Rahman, and Nabeel Mohammed. 2026. Adaptive Weighted Proxy Tuning: Efficient Gray-Box Steering for Image Captioning.. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), pages 1197–1217, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Adaptive Weighted Proxy Tuning: Efficient Gray-Box Steering for Image Captioning. (Azim et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-industry.85.pdf