Wenjia Bai

2025

Recent advancements in multimodal representation learning for electrocardiogram (ECG) have moved onto learning representations by aligning ECG signals with their paired free-text reports. However, current methods often result in suboptimal alignment of ECG signals with their corresponding text reports, thereby limiting diagnostic accuracy. This is primarily due to the complexity and unstructured nature of medical language, which makes it challenging to effectively align ECG signals with the corresponding text reports. Additionally, these methods are unable to handle arbitrary combinations of ECG leads as inputs, which poses a challenge since 12-lead ECGs may not always be available in under-resourced clinical environments.In this work, we propose the **Knowledge-enhanced Multimodal ECG Representation Learning (K-MERL)** framework to address these challenges. K-MERL leverages large language models (LLMs) to extract structured knowledge from free-text reports, enhancing the effectiveness of ECG multimodal learning. Furthermore, we design a lead-aware ECG encoder to capture lead-specific spatial-temporal characteristics of 12-lead ECGs, with dynamic lead masking. This novel encoder allows our framework to handle arbitrary lead inputs, rather than being limited to a fixed set of full 12 leads, which existing methods necessitate.We evaluate K-MERL on six external ECG datasets and demonstrate its superior capability. K-MERL not only outperforms all existing methods in zero-shot classification and linear probing tasks using 12 leads, but also achieves state-of-the-art (SOTA) results in partial-lead settings, with an average improvement of **16%** in AUC score on zero-shot classification compared to previous SOTA multimodal methods. All data and code will be released upon acceptance.

2024

pdf bib abs
BiCAL: Bi-directional Contrastive Active Learning for Clinical Report Generation
Tianyi Wu | Jingqing Zhang | Wenjia Bai | Kai Sun
Proceedings of the 23rd Workshop on Biomedical Natural Language Processing

State-of-the-art performance by large pre-trained models in computer vision (CV) and natural language processing (NLP) suggests their potential for domain-specific tasks. However, training these models requires vast amounts of labelled data, a challenge in many domains due to the cost and expertise required for data labelling. Active Learning (AL) can mitigate this by selecting minimal yet informative data for model training. While AL has been mainly applied to single-modal tasks in the fields of NLP and CV, its application in multi-modal tasks remains underexplored. In this work, we proposed a novel AL strategy, Bidirectional Contrastive Active Learning strategy (BiCAL), that used both image and text latent spaces to identify contrastive samples to select batches to query for labels. BiCAL was robust to class imbalance data problems by its design, which is a problem that is commonly seen in training domain-specific models. We assessed BiCAL’s performance in domain-specific learning on the clinical report generation tasks from chest X-ray images. Our experiments showed that BiCAL outperforms State-of-the-art methods in clinical efficacy metrics, improving recall by 2.4% and F1 score by 9.5%, showcasing its effectiveness in actively training domain-specific multi-modal models.

Co-authors

Haozhe Wang 1

Tianyi Wu 1

Jingqing Zhang 1

Venues

Fix author