2024
pdf
abs
Can Large Multimodal Models Uncover Deep Semantics Behind Images?
Yixin Yang
|
Zheng Li
|
Qingxiu Dong
|
Heming Xia
|
Zhifang Sui
Findings of the Association for Computational Linguistics ACL 2024
Understanding the deep semantics of images is essential in the era dominated by social media. However, current research works primarily on the superficial description of images, revealing a notable deficiency in the systematic investigation of the inherent deep semantics. In this work, we introduce DEEPEVAL, a comprehensive benchmark to assess Large Multimodal Models’ (LMMs) capacities of visual deep semantics. DEEPEVAL includes human-annotated dataset and three progressive subtasks: fine-grained description selection, in-depth title matching, and deep semantics understanding. Utilizing DEEPEVAL, we evaluate 9 open-source LMMs and GPT-4V(ision). Our evaluation demonstrates a substantial gap between the deep semantic comprehension capabilities of existing LMMs and humans. For example, GPT-4V is 30% behind humans in understanding deep semantics, even though it achieves human-comparable performance in image description. Further analysis reveals that LMM performance on DEEPEVAL varies according to the specific facets of deep semantics explored, indicating the fundamental challenges remaining in developing LMMs.
pdf
abs
Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding
Heming Xia
|
Zhe Yang
|
Qingxiu Dong
|
Peiyi Wang
|
Yongqi Li
|
Tao Ge
|
Tianyu Liu
|
Wenjie Li
|
Zhifang Sui
Findings of the Association for Computational Linguistics ACL 2024
To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first drafts several future tokens efficiently and then verifies them in parallel. Unlike autoregressive decoding, Speculative Decoding facilitates the simultaneous decoding of multiple tokens per step, thereby accelerating inference. This paper presents a comprehensive overview and analysis of this promising decoding paradigm. We begin by providing a formal definition and formulation of Speculative Decoding. Then, we organize in-depth discussions on its key facets, such as drafter selection and verification strategies. Furthermore, we present a comparative analysis of leading methods under third-party testing environments. We aim for this work to serve as a catalyst for further research on Speculative Decoding, ultimately contributing to more efficient LLM inference.
2023
pdf
abs
Enhancing Continual Relation Extraction via Classifier Decomposition
Heming Xia
|
Peiyi Wang
|
Tianyu Liu
|
Binghuai Lin
|
Yunbo Cao
|
Zhifang Sui
Findings of the Association for Computational Linguistics: ACL 2023
Continual relation extraction (CRE) models aim at handling emerging new relations while avoiding catastrophically forgetting old ones in the streaming data. Though improvements have been shown by previous CRE studies, most of them only adopt a vanilla strategy when models first learn representations of new relations. In this work, we point out that there exist two typical biases after training of this vanilla strategy: classifier bias and representation bias, which causes the previous knowledge that the model learned to be shaded. To alleviate those biases, we propose a simple yet effective classifier decomposition framework that splits the last FFN layer into separated previous and current classifiers, so as to maintain previous knowledge and encourage the model to learn more robust representations at this training stage. Experimental results on two standard benchmarks show that our proposed framework consistently outperforms the state-of-the-art CRE models, which indicates that the importance of the first training stage to CRE models may be underestimated. Our code will be released upon acceptance.
pdf
abs
ImageNetVC: Zero- and Few-Shot Visual Commonsense Evaluation on 1000 ImageNet Categories
Heming Xia
|
Qingxiu Dong
|
Lei Li
|
Jingjing Xu
|
Tianyu Liu
|
Ziwei Qin
|
Zhifang Sui
Findings of the Association for Computational Linguistics: EMNLP 2023
Recently, Large Language Models (LLMs) have been serving as general-purpose interfaces, posing a significant demand for comprehensive visual knowledge. However, it remains unclear how well current LLMs and their visually augmented counterparts (VaLMs) can master visual commonsense knowledge. To investigate this, we propose ImageNetVC, a human-annotated dataset specifically designed for zero- and few-shot visual commonsense evaluation across 1,000 ImageNet categories. Utilizing ImageNetVC, we benchmark the fundamental visual commonsense knowledge of both unimodal LLMs and VaLMs. Furthermore, we analyze the factors affecting the visual commonsense knowledge of large-scale models, providing insights into the development of language models enriched with visual commonsense knowledge. Our code and dataset are available at https://github.com/hemingkx/ImageNetVC.
pdf
abs
Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation
Heming Xia
|
Tao Ge
|
Peiyi Wang
|
Si-Qing Chen
|
Furu Wei
|
Zhifang Sui
Findings of the Association for Computational Linguistics: EMNLP 2023
We propose Speculative Decoding (SpecDec), for the first time ever, to formally study exploiting the idea of speculative execution to accelerate autoregressive (AR) decoding. Speculative Decoding has two innovations: Spec-Drafter – an independent model specially optimized for efficient and accurate drafting – and Spec-Verification – a reliable method for verifying the drafted tokens efficiently in the decoding paradigm. Experimental results on various seq2seq tasks including machine translation and abstractive summarization show our approach can achieve around 5x speedup for the popular Transformer architectures with comparable generation quality to beam search decoding, refreshing the impression that the draft-then-verify paradigm introduces only 1.4x~2x speedup. In addition to the remarkable speedup, we also demonstrate 3 additional advantages of SpecDec, revealing its practical value for accelerating generative models in real-world applications. Our models and codes are available at https://github.com/hemingkx/SpecDec.
pdf
abs
Bi-Drop: Enhancing Fine-tuning Generalization via Synchronous sub-net Estimation and Optimization
Shoujie Tong
|
Heming Xia
|
Damai Dai
|
Runxin Xu
|
Tianyu Liu
|
Binghuai Lin
|
Yunbo Cao
|
Zhifang Sui
Findings of the Association for Computational Linguistics: EMNLP 2023
Pretrained language models have achieved remarkable success in natural language understanding. However, fine-tuning pretrained models on limited training data tends to overfit and thus diminish performance. This paper presents Bi-Drop, a fine-tuning strategy that selectively updates model parameters using gradients from various sub-nets dynamically generated by dropout. The sub-net estimation of Bi-Drop is performed in an in-batch manner, so it overcomes the problem of hysteresis in sub-net updating, which is possessed by previous methods that perform asynchronous sub-net estimation. Also, Bi-Drop needs only one mini-batch to estimate the sub-net so it achieves higher utility of training data. Experiments on the GLUE benchmark demonstrate that Bi-Drop consistently outperforms previous fine-tuning methods. Furthermore, empirical results also show that Bi-Drop exhibits excellent generalization ability and robustness for domain transfer, data imbalance, and low-resource scenarios.
2022
pdf
abs
Premise-based Multimodal Reasoning: Conditional Inference on Joint Textual and Visual Clues
Qingxiu Dong
|
Ziwei Qin
|
Heming Xia
|
Tian Feng
|
Shoujie Tong
|
Haoran Meng
|
Lin Xu
|
Zhongyu Wei
|
Weidong Zhan
|
Baobao Chang
|
Sujian Li
|
Tianyu Liu
|
Zhifang Sui
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
It is a common practice for recent works in vision language cross-modal reasoning to adopt a binary or multi-choice classification formulation taking as input a set of source image(s) and textual query. In this work, we take a sober look at such an “unconditional” formulation in the sense that no prior knowledge is specified with respect to the source image(s). Inspired by the designs of both visual commonsense reasoning and natural language inference tasks, we propose a new task termed “Premise-based Multi-modal Reasoning” (PMR) where a textual premise is the background presumption on each source image. The PMR dataset contains 15,360 manually annotated samples which are created by a multi-phase crowd-sourcing process. With selected high-quality movie screenshots and human-curated premise templates from 6 pre-defined categories, we ask crowd-source workers to write one true hypothesis and three distractors (4 choices) given the premise and image through a cross-check procedure.