Multimodal Item Categorization Fully Based on Transformer

Lei Chen, Houwei Chou, Yandi Xia, Hirokazu Miyake


Abstract
The Transformer has proven to be a powerful feature extraction method and has gained widespread adoption in natural language processing (NLP). In this paper we propose a multimodal item categorization (MIC) system solely based on the Transformer for both text and image processing. On a multimodal product data set collected from a Japanese e-commerce giant, we tested a new image classification model based on the Transformer and investigated different ways of fusing bi-modal information. Our experimental results on real industry data showed that the Transformer-based image classifier has performance on par with ResNet-based classifiers and is four times faster to train. Furthermore, a cross-modal attention layer was found to be critical for the MIC system to achieve performance gains over text-only and image-only models.
Anthology ID:
2021.ecnlp-1.13
Volume:
Proceedings of the 4th Workshop on e-Commerce and NLP
Month:
August
Year:
2021
Address:
Online
Editors:
Shervin Malmasi, Surya Kallumadi, Nicola Ueffing, Oleg Rokhlenko, Eugene Agichtein, Ido Guy
Venue:
ECNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
111–115
Language:
URL:
https://aclanthology.org/2021.ecnlp-1.13
DOI:
10.18653/v1/2021.ecnlp-1.13
Bibkey:
Cite (ACL):
Lei Chen, Houwei Chou, Yandi Xia, and Hirokazu Miyake. 2021. Multimodal Item Categorization Fully Based on Transformer. In Proceedings of the 4th Workshop on e-Commerce and NLP, pages 111–115, Online. Association for Computational Linguistics.
Cite (Informal):
Multimodal Item Categorization Fully Based on Transformer (Chen et al., ECNLP 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-2024-clasp/2021.ecnlp-1.13.pdf