MDC3: A Novel Multimodal Dataset for Commercial Content Classification in Bengali
Anik Mahmud Shanto, Mst. Sanjida Jamal Priya, Fahim Shakil Tamim, Mohammed Moshiul Hoque
Abstract
Identifying commercial posts in resource-constrained languages among diverse and unstructured content remains a significant challenge for automatic text classification tasks. To address this, this work introduces a novel dataset named MDC3 (Multimodal Dataset for Commercial Content Classification), comprising 5,007 annotated Bengali social media posts classified as commercial and noncommercial. A comprehensive annotation guideline accompanying the dataset is included to aid future dataset creation in resource-constrained languages. Furthermore, we performed extensive experiments on MDC3 considering both unimodal and multimodal domains. Specifically, the late fusion of textual (mBERT) and visual (ViT) models (i.e., ViT+mBERT) achieves the highest F1 score of 90.91, significantly surpassing other baselines.- Anthology ID:
- 2025.naacl-srw.31
- Volume:
- Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
- Month:
- April
- Year:
- 2025
- Address:
- Albuquerque, USA
- Editors:
- Abteen Ebrahimi, Samar Haider, Emmy Liu, Sammar Haider, Maria Leonor Pacheco, Shira Wein
- Venues:
- NAACL | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 311–320
- Language:
- URL:
- https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.naacl-srw.31/
- DOI:
- Cite (ACL):
- Anik Mahmud Shanto, Mst. Sanjida Jamal Priya, Fahim Shakil Tamim, and Mohammed Moshiul Hoque. 2025. MDC3: A Novel Multimodal Dataset for Commercial Content Classification in Bengali. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop), pages 311–320, Albuquerque, USA. Association for Computational Linguistics.
- Cite (Informal):
- MDC3: A Novel Multimodal Dataset for Commercial Content Classification in Bengali (Shanto et al., NAACL 2025)
- PDF:
- https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.naacl-srw.31.pdf