Shichao Pei


2024

pdf
RAt: Injecting Implicit Bias for Text-To-Image Prompt Refinement Models
Ziyi Kou | Shichao Pei | Meng Jiang | Xiangliang Zhang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Text-to-image prompt refinement (T2I-Refine) aims to rephrase or extend an input prompt with more descriptive details that can be leveraged to generate images with higher quality. In this paper, we study an adversarial prompt attacking problem for T2I-Refine, where to goal is to implicitly inject specific concept bias to the input prompts during the refinement process so that the generated images, still with higher quality, are explicitly biased to the target group. Our study is motivated by the limitation of current T2I-Refine research that lacks of explorations on the potential capacity of T2I-Refine models to provide prompt refinement service in a biased or advertising manner. To address the limitations, we develop RAt, a prompt refinement and attacking framework that attacks input prompts with intentionally selected adversarial replacements by optimizing a token distribution matrix based on the text-to-image finetuning strategy with a token-level bias obfuscation loss as regularization. We evaluate RAt on a large-scale text-to-image dataset with various concepts as target in both in-domain and transfer-domain scenarios. The evaluation results demonstrate that, compared to other T2I-Refine schemes, RAt is well capable of implicitly attacking input prompts to generate images with higher quality and explicit visual bias towards specific concept group.

pdf
Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models
Shengzhi Li | Rongyu Lin | Shichao Pei
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Multi-modal large language models (MLLMs) are expected to support multi-turn queries of interchanging image and text modalities in production. However, the current MLLMs trained with visual-question-answering (VQA) datasets could suffer from degradation, as VQA datasets lack the diversity and complexity of the original text instruction datasets with which the underlying language model was trained. To address this degradation, we first collect a lightweight, 5k-sample VQA preference dataset where answers were annotated by Gemini for five quality metrics in a granular fashion and investigate standard Supervised Fine-tuning, rejection sampling, Direct Preference Optimization (DPO) and SteerLM algorithms. Our findings indicate that with DPO, we can surpass the instruction-following capabilities of the language model, achieving a 6.73 score on MT-Bench, compared to Vicuna’s 6.57 and LLaVA’s 5.99. This enhancement in textual instruction-following capability correlates with boosted visual instruction performance (+4.9% on MM-Vet, +6% on LLaVA-Bench), with minimal alignment tax on visual knowledge benchmarks compared to the previous RLHF approach. In conclusion, we propose a distillation-based multi-modal alignment model with fine-grained annotations on a small dataset that restores and boosts MLLM’s language capability after visual instruction tuning.

2023

pdf
Few-shot Low-resource Knowledge Graph Completion with Reinforced Task Generation
Shichao Pei | Qiannan Zhang | Xiangliang Zhang
Findings of the Association for Computational Linguistics: ACL 2023

Despite becoming a prevailing paradigm for organizing knowledge, most knowledge graphs (KGs) suffer from the low-resource issue due to the deficiency of data sources. The enrichment of KGs by automatic knowledge graph completion is impeded by the intrinsic long-tail property of KGs. In spite of their prosperity, existing few-shot learning-based models have difficulty alleviating the impact of the long-tail issue on low-resource KGs because of the lack of training tasks. To tackle the challenging long-tail issue on low-resource KG completion, in this paper, we propose a novel few-shot low-resource knowledge graph completion framework, which is composed of three components, i.e., few-shot learner, task generator, and task selector. The key idea is to generate and then select the beneficial few-shot tasks that complement the current tasks and enable the optimization of the few-shot learner using the selected few-shot tasks. Extensive experiments conducted on several real-world knowledge graphs validate the effectiveness of our proposed method.