Mi Zhang

2025

Electrocardiogram (ECG) is the primary non-invasive diagnostic tool for monitoring cardiac conditions and is crucial in assisting clinicians. Recent studies have concentrated on classifying cardiac conditions using ECG data but have overlooked ECG report generation, which is time-consuming and requires clinical expertise. To automate ECG report generation and ensure its versatility, we propose the Multimodal ECG Instruction Tuning (MEIT) framework, the first attempt to tackle ECG report generation with LLMs and multimodal instructions. To facilitate future research, we establish a benchmark to evaluate MEIT with various LLMs backbones across two large-scale ECG datasets. Our approach uniquely aligns the representations of the ECG signal and the report, and we conduct extensive experiments to benchmark MEIT with nine open-source LLMs using more than 800,000 ECG reports. MEIT’s results underscore the superior performance of instruction-tuned LLMs, showcasing their proficiency in quality report generation, zero-shot capabilities, resilience to signal perturbation, and alignment with human expert evaluation. These findings emphasize the efficacy of our MEIT framework and its potential for real-world clinical application.

Automatic radiology report generation holds significant potential to streamline the labor-intensive process of report writing by radiologists, particularly for 3D radiographs such as CT scans. While CT scans are critical for clinical diagnostics, they remain less explored compared to 2D radiographs. To date, there has been no comprehensive benchmark for 3D radiograph report generation (3DRRG), nor sufficient investigation into the optimal training strategies for Vision Language Models (VLMs) in this context, particularly with respect to vision encoder choices, visual token compression, and model scaling.In this work, we make two three contributions. We curate CT-3DRRG, the largest publicly available 3D CT-report dataset, establishing a robust and diverse benchmark for evaluating VLM performance on 3DRRG. Furthermore, we propose a comprehensive training recipe for building high-performing VLMs for 3DRRG, exploring key factors such as vision encoder pretraining strategies, visual token compression, and the impact of data & model scale. Guided by these findings, we introduce Argus, a state-of-the-art family of VLMs that achieve superior performance across different model sizes and input 3D medical image resolutions, efficiently processing high-resolution 3D images up to 512 × 512 × 256.

Large vision-language models (VLMs) have demonstrated remarkable abilities in understanding everyday content. However, their performance in the domain of art, particularly culturally rich art forms, remains less explored. As a pearl of human wisdom and creativity, art encapsulates complex cultural narratives and symbolism. In this paper, we offer the Pun Rebus Art Dataset, a multimodal dataset for art understanding deeply rooted in traditional Chinese culture. We focus on three primary tasks: identifying salient visual elements, matching elements with their symbolic meanings, and explanations for the conveyed messages. Our evaluation reveals that state-of-the-art VLMs struggle with these tasks, often providing biased and hallucinated explanations and showing limited improvement through in-context learning. By releasing the Pun Rebus Art Dataset, we aim to facilitate the development of VLMs that can better understand and interpret culturally specific content, promoting greater inclusiveness beyond English-based corpora. The dataset and evaluation code are available at [this link](https://github.com/zhang-tuo-pdf/Pun-Rebus-Art-Benchmark).

pdf bib abs
MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference
Zhongwei Wan | Hui Shen | Xin Wang | Che Liu | Zheda Mai | Mi Zhang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Long-context Multimodal Large Language Models (MLLMs) that incorporate long text-image and text-video modalities, demand substantial computational resources as their multimodal Key-Value (KV) cache grows with increasing input lengths, challenging memory and time efficiency. For multimodal scenarios, the cross-modal interactions inevitablely increase complexity, and prior methods for KV cache compression, in both text-only and multimodal LLMs, have neglected attention density variations across layers, often adopting uniform or progressive reduction strategis for layer-wise cache allocation. This results in precision loss and suboptimal performance. We propose MEDA, a novel approach specifically designed for the complexities of multimodal settings, dynamically allocating KV cache sizes based on attention entropy to better adapt to multimodal interactions.Through a dynamic multimodal KV cache allocation strategy, MEDA compresses the KV cache, adaptively retains sufficient multimodal information at each layer. Meanwhile, to mitigate the degradation of contextual information due to cache compression, we also integrate KV pairs merging techniques to maintain coherence. MEDA achieves up to 72% KV cache memory reduction and 2.82 faster decoding speeds in some cases, while maintaining or enhancing performance on various multimodal tasks in a long context, including multi-image and long video scenarios.

pdf bib abs
SVD-LLM V2: Optimizing Singular Value Truncation for Large Language Model Compression
Xin Wang | Samiul Alam | Zhongwei Wan | Hui Shen | Mi Zhang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Despite significant advancements, the practical deployment of Large Language Models (LLMs) is often hampered by their immense sizes, highlighting the need for effective compression techniques. Singular Value Decomposition (SVD) emerges as a promising method for compressing LLMs. However, existing SVD-based compression approaches suffer from substantial truncation losses, leading to severe performance degradation in compressed models. In this work, we introduce , a novel SVD-based LLM compression method that optimizes singular value truncation in SVD compression with two key strategies. First, employs dynamic compression ratio allocation to effectively balance the extremely large truncation loss across different layers. Second, it implements loss-optimized weight truncation to ensure that the truncated singular values result in a lower and more stable truncation loss in practice. We evaluate on ten datasets and five models on various scales and demonstrated that outperforms current state-of-the-art methods. The source code is available at https://github.com/AIoT-MLSys-Lab/SVD-LLM.

2023

pdf bib abs
SlowBERT: Slow-down Attacks on Input-adaptive Multi-exit BERT
Shengyao Zhang | Xudong Pan | Mi Zhang | Min Yang
Findings of the Association for Computational Linguistics: ACL 2023

For pretrained language models such as Google’s BERT, recent research designs several input-adaptive inference mechanisms to improve the efficiency on cloud and edge devices. In this paper, we reveal a new attack surface on input-adaptive multi-exit BERT, where the adversary imperceptibly modifies the input texts to drastically increase the average inference cost. Our proposed slow-down attack called SlowBERT integrates a new rank-and-substitute adversarial text generation algorithm to efficiently search for the perturbation which maximally delays the exiting time. With no direct access to the model internals, we further devise a time-based approximation algorithm to infer the exit position as the loss oracle. Our extensive evaluation on two popular instances of multi-exit BERT for GLUE classification tasks validates the effectiveness of SlowBERT. In the worst case, SlowBERT increases the inference cost by 4.57×, which would strongly hurt the service quality of multi-exit BERT in practice, e.g., increasing the real-time cloud services’ response times for online users.

2020

pdf bib abs
Convolution over Hierarchical Syntactic and Lexical Graphs for Aspect Level Sentiment Analysis
Mi Zhang | Tieyun Qian
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

The state-of-the-art methods in aspect-level sentiment classification have leveraged the graph based models to incorporate the syntactic structure of a sentence. While being effective, these methods ignore the corpus level word co-occurrence information, which reflect the collocations in linguistics like “nothing special”. Moreover, they do not distinguish the different types of syntactic dependency, e.g., a nominal subject relation “food-was” is treated equally as an adjectival complement relation “was-okay” in “food was okay”. To tackle the above two limitations, we propose a novel architecture which convolutes over hierarchical syntactic and lexical graphs. Specifically, we employ a global lexical graph to encode the corpus level word co-occurrence information. Moreover, we build a concept hierarchy on both the syntactic and lexical graphs for differentiating various types of dependency relations or lexical word pairs. Finally, we design a bi-level interactive graph convolution network to fully exploit these two graphs. Extensive experiments on five bench- mark datasets show that our method outperforms the state-of-the-art baselines.