Yudi Zhang


2025

pdf bib
FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling
Weilin Zhao | Tengyu Pan | Xu Han | Yudi Zhang | Ao Sun | Yuxiang Huang | Kaihuo Zhang | Weilun Zhao | Yuxuan Li | Jie Zhou | Hao Zhou | Jianyong Wang | Zhiyuan Liu | Maosong Sun
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution. Experiments across multiple datasets demonstrate an average of 1.12× speedup over the state-of-the-art speculative sampling method EAGLE-2. Code is availableat https://github.com/thunlp/FR-Spec.

pdf bib
Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models
Jie Liu | Wenxuan Wang | Su Yihang | Jingyuan Huang | Yudi Zhang | Cheng-Yi Li | Wenting Chen | Xiaohan Xing | Kao-Jung Chang | Linlin Shen | Michael R. Lyu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The significant breakthroughs of Medical Multi-Modal Large Language Models (Med-MLLMs) renovate modern healthcare with robust information synthesis and medical decision support. However, these models are often evaluated on benchmarks that are unsuitable for the Med-MLLMs due to the intricate nature of the real-world diagnostic frameworks, which encompass diverse medical specialties and involve complex clinical decisions. Thus, a clinically representative benchmark is highly desirable for credible Med-MLLMs evaluation. To this end, we introduce Asclepius, a novel Med-MLLM benchmark that comprehensively assesses Med-MLLMs in terms of: distinct medical specialties (cardiovascular, gastroenterology, etc.) and different diagnostic capacities (perception, disease analysis, etc.). Grounded in 3 proposed core principles, Asclepius ensures a comprehensive evaluation by encompassing 15 medical specialties, stratifying into 3 main categories and 8 sub-categories of clinical tasks, and exempting overlap with the existing VQA dataset. We further provide an in-depth analysis of 6 Med-MLLMs and compare them with 3 human specialists, providing insights into their competencies and limitations in various medical contexts. Our work not only advances the understanding of Med-MLLMs’ capabilities but also sets a precedent for future evaluations and the safe deployment of these models in clinical environments.

pdf bib
Benchmarking Foundation Models with Retrieval-Augmented Generation in Olympic-Level Physics Problem Solving
Shunfeng Zheng | Yudi Zhang | Meng Fang | Zihan Zhang | Zhitan Wu | Mykola Pechenizkiy | Ling Chen
Findings of the Association for Computational Linguistics: EMNLP 2025

Retrieval-augmented generation (RAG) with foundation models has achieved strong performance across diverse tasks, but their capacity for expert-level reasoning—such as solving Olympiad-level physics problems—remains largely unexplored. Inspired by the way students prepare for competitions by reviewing past problems, we investigate the potential of RAG to enhance physics reasoning in foundation models. We introduce PhoPile, a high-quality multimodal dataset specifically designed for Olympiad-level physics, enabling systematic study of retrieval-based reasoning. PhoPile includes diagrams, graphs, and equations, capturing the inherently multimodal nature of physics problem solving. Using PhoPile, we benchmark RAG-augmented foundation models, covering both large language models (LLMs) and large multimodal models (LMMs) with multiple retrievers. Our results demonstrate that integrating retrieval with physics corpora can improve model performance, while also highlighting challenges that motivate further research in retrieval-augmented physics reasoning.

2023

pdf bib
MMSD2.0: Towards a Reliable Multi-modal Sarcasm Detection System
Libo Qin | Shijue Huang | Qiguang Chen | Chenran Cai | Yudi Zhang | Bin Liang | Wanxiang Che | Ruifeng Xu
Findings of the Association for Computational Linguistics: ACL 2023

Multi-modal sarcasm detection has attracted much recent attention. Nevertheless, the existing benchmark (MMSD) has some shortcomings that hinder the development of reliable multi-modal sarcasm detection system: (1) There are some spurious cues in MMSD, leading to the model bias learning; (2) The negative samples in MMSD are not always reasonable. To solve the aforementioned issues, we introduce MMSD2.0, a correction dataset that fixes the shortcomings of MMSD, by removing the spurious cues and re-annotating the unreasonable samples. Meanwhile, we present a novel framework called multi-view CLIP that is capable of leveraging multi-grained cues from multiple perspectives (i.e., text, image, and text-image interaction view) for multi-modal sarcasm detection. Extensive experiments show that MMSD2.0 is a valuable benchmark for building reliable multi-modal sarcasm detection systems and multi-view CLIP can significantly outperform the previous best baselines.