2025
pdf
bib
abs
FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling
Weilin Zhao
|
Tengyu Pan
|
Xu Han
|
Yudi Zhang
|
Sun Ao
|
Yuxiang Huang
|
Kaihuo Zhang
|
Weilun Zhao
|
Yuxuan Li
|
Jie Zhou
|
Hao Zhou
|
Jianyong Wang
|
Maosong Sun
|
Zhiyuan Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution. Experiments across multiple datasets demonstrate an average of 1.12× speedup over the state-of-the-art speculative sampling method EAGLE-2. Code is availableat https://github.com/thunlp/FR-Spec.
pdf
bib
abs
Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models
Jie Liu
|
Wenxuan Wang
|
Su Yihang
|
Jingyuan Huang
|
Yudi Zhang
|
Cheng-Yi Li
|
Wenting Chen
|
Xiaohan Xing
|
Kao-Jung Chang
|
Linlin Shen
|
Michael R. Lyu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The significant breakthroughs of Medical Multi-Modal Large Language Models (Med-MLLMs) renovate modern healthcare with robust information synthesis and medical decision support. However, these models are often evaluated on benchmarks that are unsuitable for the Med-MLLMs due to the intricate nature of the real-world diagnostic frameworks, which encompass diverse medical specialties and involve complex clinical decisions. Thus, a clinically representative benchmark is highly desirable for credible Med-MLLMs evaluation. To this end, we introduce Asclepius, a novel Med-MLLM benchmark that comprehensively assesses Med-MLLMs in terms of: distinct medical specialties (cardiovascular, gastroenterology, etc.) and different diagnostic capacities (perception, disease analysis, etc.). Grounded in 3 proposed core principles, Asclepius ensures a comprehensive evaluation by encompassing 15 medical specialties, stratifying into 3 main categories and 8 sub-categories of clinical tasks, and exempting overlap with the existing VQA dataset. We further provide an in-depth analysis of 6 Med-MLLMs and compare them with 3 human specialists, providing insights into their competencies and limitations in various medical contexts. Our work not only advances the understanding of Med-MLLMs’ capabilities but also sets a precedent for future evaluations and the safe deployment of these models in clinical environments.
2023
pdf
bib
abs
MMSD2.0: Towards a Reliable Multi-modal Sarcasm Detection System
Libo Qin
|
Shijue Huang
|
Qiguang Chen
|
Chenran Cai
|
Yudi Zhang
|
Bin Liang
|
Wanxiang Che
|
Ruifeng Xu
Findings of the Association for Computational Linguistics: ACL 2023
Multi-modal sarcasm detection has attracted much recent attention. Nevertheless, the existing benchmark (MMSD) has some shortcomings that hinder the development of reliable multi-modal sarcasm detection system: (1) There are some spurious cues in MMSD, leading to the model bias learning; (2) The negative samples in MMSD are not always reasonable. To solve the aforementioned issues, we introduce MMSD2.0, a correction dataset that fixes the shortcomings of MMSD, by removing the spurious cues and re-annotating the unreasonable samples. Meanwhile, we present a novel framework called multi-view CLIP that is capable of leveraging multi-grained cues from multiple perspectives (i.e., text, image, and text-image interaction view) for multi-modal sarcasm detection. Extensive experiments show that MMSD2.0 is a valuable benchmark for building reliable multi-modal sarcasm detection systems and multi-view CLIP can significantly outperform the previous best baselines.