CEMTM: Contextual Embedding-based Multimodal Topic Modeling

Amirhossein Abaskohi; Raymond Li; Chuyuan Li; Shafiq Joty; Giuseppe Carenini

CEMTM: Contextual Embedding-based Multimodal Topic Modeling

Amirhossein Abaskohi, Raymond Li, Chuyuan Li, Shafiq Joty, Giuseppe Carenini

Abstract

We introduce CEMTM, a context-enhanced multimodal topic model designed to infer coherent and interpretable topic structures from both short and long documents containing text and images. CEMTM builds on fine-tuned large vision language models (LVLMs) to obtain contextualized embeddings, and employs a distributional attention mechanism to weight token-level contributions to topic inference. A reconstruction objective aligns topic-based representations with the document embedding, encouraging semantic consistency across modalities. Unlike existing approaches, CEMTM can process multiple images per document without repeated encoding and maintains interpretability through explicit word-topic and document-topic distributions. Extensive experiments on six multimodal benchmarks show that CEMTM consistently outperforms unimodal and multimodal baselines, achieving a remarkable average LLM score of 2.61. Further analysis shows its effectiveness in downstream few-shot retrieval and its ability to capture visually grounded semantics in complex domains such as scientific articles.

Anthology ID:: 2025.emnlp-main.590
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 11686–11703
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.590/
DOI:
Bibkey:
Cite (ACL):: Amirhossein Abaskohi, Raymond Li, Chuyuan Li, Shafiq Joty, and Giuseppe Carenini. 2025. CEMTM: Contextual Embedding-based Multimodal Topic Modeling. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 11686–11703, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: CEMTM: Contextual Embedding-based Multimodal Topic Modeling (Abaskohi et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.590.pdf
Checklist:: 2025.emnlp-main.590.checklist.pdf

PDF Cite Search Checklist Fix data