CEMTM: Contextual Embedding-based Multimodal Topic Modeling

Amirhossein Abaskohi, Raymond Li, Chuyuan Li, Shafiq Joty, Giuseppe Carenini


Abstract
We introduce CEMTM, a context-enhanced multimodal topic model designed to infer coherent and interpretable topic structures from both short and long documents containing text and images. CEMTM builds on fine-tuned large vision language models (LVLMs) to obtain contextualized embeddings, and employs a distributional attention mechanism to weight token-level contributions to topic inference. A reconstruction objective aligns topic-based representations with the document embedding, encouraging semantic consistency across modalities. Unlike existing approaches, CEMTM can process multiple images per document without repeated encoding and maintains interpretability through explicit word-topic and document-topic distributions. Extensive experiments on six multimodal benchmarks show that CEMTM consistently outperforms unimodal and multimodal baselines, achieving a remarkable average LLM score of 2.61. Further analysis shows its effectiveness in downstream few-shot retrieval and its ability to capture visually grounded semantics in complex domains such as scientific articles.
Anthology ID:
2025.emnlp-main.590
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11686–11703
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.590/
DOI:
Bibkey:
Cite (ACL):
Amirhossein Abaskohi, Raymond Li, Chuyuan Li, Shafiq Joty, and Giuseppe Carenini. 2025. CEMTM: Contextual Embedding-based Multimodal Topic Modeling. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 11686–11703, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
CEMTM: Contextual Embedding-based Multimodal Topic Modeling (Abaskohi et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.590.pdf
Checklist:
 2025.emnlp-main.590.checklist.pdf