Approximating Learning Curves for Active-Learning-Driven Annotation

Katrin Tomanek, Udo Hahn


Abstract
Active learning (AL) is getting more and more popular as a methodology to considerably reduce the annotation effort when building training material for statistical learning methods for various NLP tasks. A crucial issue rarely addressed, however, is when to actually stop the annotation process to profit from the savings in efforts. This question is tightly related to estimating the classifier performance after a certain amount of data has already been annotated. While learning curves are the default means to monitor the progress of the annotation process in terms of classifier performance, this requires a labeled gold standard which - in realistic annotation settings, at least - is often unavailable. We here propose a method for committee-based AL to approximate the progression of the learning curve based on the disagreement among the committee members. This method relies on a separate, unlabeled corpus and is thus well suited for situations where a labeled gold standard is not available or would be too expensive to obtain. Considering named entity recognition as a test case we provide empirical evidence that this approach works well under simulation as well as under real-world annotation conditions.
Anthology ID:
L08-1208
Volume:
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Month:
May
Year:
2008
Address:
Marrakech, Morocco
Editors:
Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Daniel Tapias
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2008/pdf/335_paper.pdf
DOI:
Bibkey:
Cite (ACL):
Katrin Tomanek and Udo Hahn. 2008. Approximating Learning Curves for Active-Learning-Driven Annotation. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), Marrakech, Morocco. European Language Resources Association (ELRA).
Cite (Informal):
Approximating Learning Curves for Active-Learning-Driven Annotation (Tomanek & Hahn, LREC 2008)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2008/pdf/335_paper.pdf