Yun Zhou
2025
Black-Box Visual Prompt Engineering for Mitigating Object Hallucination in Large Vision Language Models
Sangmin Woo
|
Kang Zhou
|
Yun Zhou
|
Shuai Wang
|
Sheng Guan
|
Haibo Ding
|
Lin Lee Cheong
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)
Large Vision Language Models (LVLMs) often suffer from object hallucination, which undermines their reliability. Surprisingly, we find that simple object-based visual prompting—overlaying visual cues (e.g., bounding box, circle) on images—can significantly mitigate such hallucination; however, different visual prompts (VPs) vary in effectiveness. To address this, we propose Black-Box Visual Prompt Engineering (BBVPE), a framework to identify optimal VPs that enhance LVLM responses without needing access to model internals. Our approach employs a pool of candidate VPs and trains a router model to dynamically select the most effective VP for a given input image. This black-box approach is model-agnostic, making it applicable to both open-source and proprietary LVLMs. Evaluations on benchmarks such as POPE and CHAIR demonstrate that BBVPE effectively reduces object hallucination.
2020
DiPair: Fast and Accurate Distillation for Trillion-Scale Text Matching and Pair Modeling
Jiecao Chen
|
Liu Yang
|
Karthik Raman
|
Michael Bendersky
|
Jung-Jung Yeh
|
Yun Zhou
|
Marc Najork
|
Danyang Cai
|
Ehsan Emadzadeh
Findings of the Association for Computational Linguistics: EMNLP 2020
Pre-trained models like BERT ((Devlin et al., 2018) have dominated NLP / IR applications such as single sentence classification, text pair classification, and question answering. However, deploying these models in real systems is highly non-trivial due to their exorbitant computational costs. A common remedy to this is knowledge distillation (Hinton et al., 2015), leading to faster inference. However – as we show here – existing works are not optimized for dealing with pairs (or tuples) of texts. Consequently, they are either not scalable or demonstrate subpar performance. In this work, we propose DiPair — a novel framework for distilling fast and accurate models on text pair tasks. Coupled with an end-to-end training strategy, DiPair is both highly scalable and offers improved quality-speed tradeoffs. Empirical studies conducted on both academic and real-world e-commerce benchmarks demonstrate the efficacy of the proposed approach with speedups of over 350x and minimal quality drop relative to the cross-attention teacher BERT model.
Search
Fix data
Co-authors
- Michael Bendersky 1
- Danyang Cai 1
- Jiecao Chen 1
- Lin Lee Cheong 1
- Haibo Ding 1
- show all...