Shahriyar Zaman Ridoy


2025

pdf bib
CAPSTONE: Composable Attribute‐Prompted Scene Translation for Zero‐Shot Vision–Language Reasoning
Md. Ismail Hossain | Shahriyar Zaman Ridoy | Moshiur Farazi | Nabeel Mohammed | Shafin Rahman
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

Interpreting visual scenes with high-level reasoning is essential for many real-world applications, such as autonomous systems andcontent moderation, but training and maintaining Vision–Language Models (VLMs) remains resource-intensive and opaque. In this work, we present CAPSTONE, a lightweight, modular framework designed for industrial settings. Instead of relying on multimodal training or fine-tuning large models, CAPSTONE transforms outputs from off-the-shelf vision models into structured text prompts that can be interpreted by a frozen Large Language Model (LLM). This plug-and-play architecture enables reasoning over visual input without access to raw pixels, dramatically reducing computational cost and complexity. On the POPE dataset, our system, using a 7B LLM, outperforms several fully trained VLMs in zero-shot evaluations, while on the VSR benchmark, the 4B model achieves competitive results, together demonstrating strong generalization without retraining. CAPSTONE offers a scalable and interpretable alternative for companies looking to integrate visual reasoning capabilities without the burden of full-scale VLM pipelines.