Investigating Active Learning Sampling Strategies for Extreme Multi Label Text Classification
Lukas Wertz, Katsiaryna Mirylenka, Jonas Kuhn, Jasmina Bogojeska
Abstract
Large scale, multi-label text datasets with high numbers of different classes are expensive to annotate, even more so if they deal with domain specific language. In this work, we aim to build classifiers on these datasets using Active Learning in order to reduce the labeling effort. We outline the challenges when dealing with extreme multi-label settings and show the limitations of existing Active Learning strategies by focusing on their effectiveness as well as efficiency in terms of computational cost. In addition, we present five multi-label datasets which were compiled from hierarchical classification tasks to serve as benchmarks in the context of extreme multi-label classification for future experiments. Finally, we provide insight into multi-class, multi-label evaluation and present an improved classifier architecture on top of pre-trained transformer language models.- Anthology ID:
- 2022.lrec-1.490
- Volume:
- Proceedings of the Thirteenth Language Resources and Evaluation Conference
- Month:
- June
- Year:
- 2022
- Address:
- Marseille, France
- Editors:
- Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 4597–4605
- Language:
- URL:
- https://aclanthology.org/2022.lrec-1.490
- DOI:
- Cite (ACL):
- Lukas Wertz, Katsiaryna Mirylenka, Jonas Kuhn, and Jasmina Bogojeska. 2022. Investigating Active Learning Sampling Strategies for Extreme Multi Label Text Classification. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4597–4605, Marseille, France. European Language Resources Association.
- Cite (Informal):
- Investigating Active Learning Sampling Strategies for Extreme Multi Label Text Classification (Wertz et al., LREC 2022)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/2022.lrec-1.490.pdf
- Data
- New York Times Annotated Corpus, RCV1