Text2Topic: Multi-Label Text Classification System for Efficient Topic Detection in User Generated Content with Zero-Shot Capabilities
Fengjun Wang, Moran Beladev, Ofri Kleinfeld, Elina Frayerman, Tal Shachar, Eran Fainman, Karen Lastmann Assaraf, Sarai Mizrachi, Benjamin Wang
Abstract
Multi-label text classification is a critical task in the industry. It helps to extract structured information from large amount of textual data. We propose Text to Topic (Text2Topic), which achieves high multi-label classification performance by employing a Bi-Encoder Transformer architecture that utilizes concatenation, subtraction, and multiplication of embeddings on both text and topic. Text2Topic also supports zero-shot predictions, produces domain-specific text embeddings, and enables production-scale batch-inference with high throughput. The final model achieves accurate and comprehensive results compared to state-of-the-art baselines, including large language models (LLMs). In this study, a total of 239 topics are defined, and around 1.6 million text-topic pairs annotations (in which 200K are positive) are collected on approximately 120K texts from 3 main data sources on Booking.com. The data is collected with optimized smart sampling and partial labeling. The final Text2Topic model is deployed on a real-world stream processing platform, and it outperforms other models with 92.9% micro mAP, as well as a 75.8% macro mAP score. We summarize the modeling choices which are extensively tested through ablation studies, and share detailed in-production decision-making steps.- Anthology ID:
- 2023.emnlp-industry.10
- Volume:
- Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Mingxuan Wang, Imed Zitouni
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 93–103
- Language:
- URL:
- https://aclanthology.org/2023.emnlp-industry.10
- DOI:
- 10.18653/v1/2023.emnlp-industry.10
- Cite (ACL):
- Fengjun Wang, Moran Beladev, Ofri Kleinfeld, Elina Frayerman, Tal Shachar, Eran Fainman, Karen Lastmann Assaraf, Sarai Mizrachi, and Benjamin Wang. 2023. Text2Topic: Multi-Label Text Classification System for Efficient Topic Detection in User Generated Content with Zero-Shot Capabilities. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 93–103, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- Text2Topic: Multi-Label Text Classification System for Efficient Topic Detection in User Generated Content with Zero-Shot Capabilities (Wang et al., EMNLP 2023)
- PDF:
- https://preview.aclanthology.org/ingest-acl-2023-videos/2023.emnlp-industry.10.pdf