Adaptive Weighted Proxy Tuning: Efficient Gray-Box Steering for Image Captioning.

Nafew Azim, Fuad Rahman, Nabeel Mohammed


Abstract
Adapting Large Vision-Language Models (LVLMs) to specialized domains typically demands resource-intensive fine-tuning or access to proprietary parameters (“white-box” access). While decoding-time strategies like Proxy Tuning offer a parameter-efficient alternative, they rely on rigid, static logit arithmetic that fails to account for instance-specific variations in model certainty and domain shift. In this work, we introduce Adaptive Weighted Proxy Tuning (AWPT), a gray-box steering framework that dynamically modulates the logit contributions of a large base model, a fine-tuned expert, and an untuned anti-expert. Unlike static approaches, AWPT introduces two instance-aware mechanisms: (1) a lightweight ViT-based Weight Predictor that performs amortized inference to estimate optimal mixing coefficients in real-time with negligible added latency (0.03s overhead), and (2) a Per-Sample Optimization objective that establishes theoretical performance bounds via gradient-based logit steering. Extensive evaluation across medical (ROCOv2, IU-Xray) and general domains (Flickr30k, MS COCO, TextCaps) demonstrates that AWPT achieves performance parity with fully fine-tuned models while remaining parameter-free regarding the generator. Crucially, our dynamic weighting acts as an effective regularizer, significantly reducing object hallucinations and establishing AWPT as a robust solution for deploying general-purpose LVLMs in safety-critical contexts.
Anthology ID:
2026.acl-industry.85
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Yunyao Li, Georg Rehm, Mei Tu
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1197–1217
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.acl-industry.85/
DOI:
Bibkey:
Cite (ACL):
Nafew Azim, Fuad Rahman, and Nabeel Mohammed. 2026. Adaptive Weighted Proxy Tuning: Efficient Gray-Box Steering for Image Captioning.. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), pages 1197–1217, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Adaptive Weighted Proxy Tuning: Efficient Gray-Box Steering for Image Captioning. (Azim et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.acl-industry.85.pdf