Early Fusion with Contrastive Learning: A Lightweight Alternative for Multi-modal Classification

Felix Wernlein, Abhik Jana, Sandipan Sikdar


Abstract
With the emergence of numerous modalities, such as text, image, audio, etc., the use of effective multimodal systems has increased significantly. However, one of the significant challenges faced by such multimodal systems is effectively aligning and integrating diverse modalities. Several models have been proposed to address these issues; however, state-of-the-art performance is achieved by complex, heavyweight models (complexity measured in terms of trainable parameters) alone. Hence, we propose a simple yet effective lightweight framework explicitly designed for multimodal classification tasks, utilising the early fusion method combined with a contrastive learning approach. The early fusion method focuses on fusing different modalities at the input level, whereas contrastive learning allows a single modality to capture intra-modality relationships. Experiments on three different genres of multimodal classification datasets demonstrate that the proposed lightweight framework achieves performance comparable to the most competitive heavyweight state-of-the-art models and, in some cases, even outperforms them.
Anthology ID:
2026.lrec-main.717
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
9129–9138
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.717/
DOI:
Bibkey:
Cite (ACL):
Felix Wernlein, Abhik Jana, and Sandipan Sikdar. 2026. Early Fusion with Contrastive Learning: A Lightweight Alternative for Multi-modal Classification. International Conference on Language Resources and Evaluation, main:9129–9138.
Cite (Informal):
Early Fusion with Contrastive Learning: A Lightweight Alternative for Multi-modal Classification (Wernlein et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.717.pdf