On the Limitations of Simulating Active Learning

Katerina Margatina, Nikolaos Aletras


Abstract
Active learning (AL) is a human-and-model-in-the-loop paradigm that iteratively selects informative unlabeled data for human annotation, aiming to improve data efficiency over random sampling. However, performing AL experiments with human annotations on-the-fly is a laborious and expensive process, thus unrealistic for academic research. An easy fix to this impediment is to simulate AL, by treating an already labeled and publicly available dataset as the pool of unlabeled data. In this position paper, we first survey recent literature and highlight the challenges across all different steps within the AL loop. We further unveil neglected caveats in the experimental setup that can significantly affect the quality of AL research. We continue with an exploration of how the simulation setting can govern empirical findings, arguing that it might be one of the answers behind the ever posed question “Why do Active Learning algorithms sometimes fail to outperform random sampling?”. We argue that evaluating AL algorithms on available labeled datasets might provide a lower bound as to their effectiveness in real data. We believe it is essential to collectively shape the best practices for AL research, especially now that the stellar engineering advances (e.g. ChatGPT) shift the research focus to data-driven approaches. To this end, we present guidelines for future work, hoping that by bringing these limitations to the community’s attention, we can explore ways to address them.
Anthology ID:
2023.findings-acl.269
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4402–4419
Language:
URL:
https://aclanthology.org/2023.findings-acl.269
DOI:
10.18653/v1/2023.findings-acl.269
Bibkey:
Cite (ACL):
Katerina Margatina and Nikolaos Aletras. 2023. On the Limitations of Simulating Active Learning. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4402–4419, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
On the Limitations of Simulating Active Learning (Margatina & Aletras, Findings 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl24-info/2023.findings-acl.269.pdf