BitMar: Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices

Euhid Aman, Esteban Carlin, Hsing-Kuo Kenneth Pao, Giovanni Beltrame, Ghaluh Indah Permata Sari, Yie-Tarng Chen


Abstract
Cross-attention transformers and other multimodal vision-language models excel at grounding and generation; however, their extensive, full-precision backbones make it challenging to deploy them on edge devices. Memory-augmented architectures enhance the utilization of past context; however, most works rarely pair them with aggressive edge-oriented quantization. We introduce BitMar, a quantized multimodal transformer that proposes an external human-like episodic memory for effective image-text generation on hardware with limited resources. BitMar utilizes 1.58-bit encoders, one for text (BitNet-style) and one for vision (DiNOv2-based), to create compact embeddings that are combined and used to query a fixed-size key-value episodic memory. During vector retrieval, the BitNet decoder applies per‐layer conditioning, which increases the contextual relevance of generated content. The decoder also employs attention sinks with a sliding‐window mechanism to process long or streaming inputs under tight memory budgets. The combination of per-layer conditioning and sliding-window attention achieves a strong quality–speed trade–off, delivering competitive captioning and multimodal understanding at low latency with a small model footprint. These characteristics make BitMar well-suited for edge deployment.
Anthology ID:
2025.babylm-main.11
Volume:
Proceedings of the First BabyLM Workshop
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Lucas Charpentier, Leshem Choshen, Ryan Cotterell, Mustafa Omer Gul, Michael Y. Hu, Jing Liu, Jaap Jumelet, Tal Linzen, Aaron Mueller, Candace Ross, Raj Sanjay Shah, Alex Warstadt, Ethan Gotlieb Wilcox, Adina Williams
Venue:
BabyLM
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
147–154
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.11/
DOI:
Bibkey:
Cite (ACL):
Euhid Aman, Esteban Carlin, Hsing-Kuo Kenneth Pao, Giovanni Beltrame, Ghaluh Indah Permata Sari, and Yie-Tarng Chen. 2025. BitMar: Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices. In Proceedings of the First BabyLM Workshop, pages 147–154, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
BitMar: Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices (Aman et al., BabyLM 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.11.pdf