ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting
Abhijit Mishra, Mingda Li, Hsiang Fu, Richard Noh, Minji Kim
Abstract
Efficient and privacy-preserving multimodal interaction is essential as AR, VR, and modern smartphones with powerful cameras become primary interfaces for human-computer communication. Existing powerful large vision-language models (VLMs) enabling multimodal interaction often rely on cloud-based processing, raising significant concerns about (1) visual privacy by transmitting sensitive vision data to servers, and (2) their limited real-time, on-device usability. This paper explores **Visual Instruction Rewriting**, a novel approach that transforms multimodal instructions into text-only commands, allowing seamless integration of lightweight on-device instruction rewriter VLMs **(250M parameters)** with existing conversational AI systems, enhancing vision data privacy. To achieve this, we present a dataset of over 39,000 examples across 14 domains and develop a compact VLM, pretrained on image captioning datasets and fine-tuned for instruction rewriting. Experimental results, evaluated through NLG metrics such as BLEU, METEOR, and ROUGE, along with semantic parsing analysis, demonstrate that even a quantized version of the model (<500MB storage footprint) can achieve effective instruction rewriting, thus enabling privacy-focused, multimodal AI applications.- Anthology ID:
- 2025.ijcnlp-long.154
- Volume:
- Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
- Month:
- December
- Year:
- 2025
- Address:
- Mumbai, India
- Editors:
- Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, Dhirendra Pratap Singh
- Venues:
- IJCNLP | AACL
- SIG:
- Publisher:
- The Asian Federation of Natural Language Processing and The Association for Computational Linguistics
- Note:
- Pages:
- 2877–2889
- Language:
- URL:
- https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-long.154/
- DOI:
- Cite (ACL):
- Abhijit Mishra, Mingda Li, Hsiang Fu, Richard Noh, and Minji Kim. 2025. ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 2877–2889, Mumbai, India. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics.
- Cite (Informal):
- ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting (Mishra et al., IJCNLP-AACL 2025)
- PDF:
- https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-long.154.pdf