Abstract
The Transformer has proven to be a powerful feature extraction method and has gained widespread adoption in natural language processing (NLP). In this paper we propose a multimodal item categorization (MIC) system solely based on the Transformer for both text and image processing. On a multimodal product data set collected from a Japanese e-commerce giant, we tested a new image classification model based on the Transformer and investigated different ways of fusing bi-modal information. Our experimental results on real industry data showed that the Transformer-based image classifier has performance on par with ResNet-based classifiers and is four times faster to train. Furthermore, a cross-modal attention layer was found to be critical for the MIC system to achieve performance gains over text-only and image-only models.- Anthology ID:
- 2021.ecnlp-1.13
- Volume:
- Proceedings of the 4th Workshop on e-Commerce and NLP
- Month:
- August
- Year:
- 2021
- Address:
- Online
- Editors:
- Shervin Malmasi, Surya Kallumadi, Nicola Ueffing, Oleg Rokhlenko, Eugene Agichtein, Ido Guy
- Venue:
- ECNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 111–115
- Language:
- URL:
- https://aclanthology.org/2021.ecnlp-1.13
- DOI:
- 10.18653/v1/2021.ecnlp-1.13
- Cite (ACL):
- Lei Chen, Houwei Chou, Yandi Xia, and Hirokazu Miyake. 2021. Multimodal Item Categorization Fully Based on Transformer. In Proceedings of the 4th Workshop on e-Commerce and NLP, pages 111–115, Online. Association for Computational Linguistics.
- Cite (Informal):
- Multimodal Item Categorization Fully Based on Transformer (Chen et al., ECNLP 2021)
- PDF:
- https://preview.aclanthology.org/ingest-2024-clasp/2021.ecnlp-1.13.pdf