We
consider
the
impact
Active
Learning
(
AL
)
has
on
effective
and
efficient
text
corpus
annotation
,
and
report
on
reduction
rates
for
annotation
efforts
ranging
up
until
72
%
.
We
also
address
the
issue
whether
a
corpus
annotated
by
means
of
AL
-
using
a
particular
classifier
and
a
particular
feature
set
-
can
be
re-used
to
train
classifiers
different
from
the
ones
employed
by
AL
,
supplying
alternative
feature
sets
as
well
.
We
,
finally
,
report
on
our
experience
with
the
AL
paradigm
under
real-world
conditions
,
i.e.
,
the
annotation
of
large-scale
document
corpora
for
the
life
sciences
.
1
Introduction
The
annotation
of
corpora
has
become
a
crucial
prerequisite
for
NLP
utilities
which
rely
on
(
semi
-
)
supervised
machine
learning
(
ML
)
techniques
.
While
stability
,
by
and
large
,
has
been
reached
for
tagsets
up
until
the
syntax
layer
,
semantic
annotations
in
terms
of
(
named
)
entities
,
semantic
roles
,
propositions
,
events
,
etc.
reveal
a
high
degree
of
variability
due
to
the
inherent
domain-dependence
ofthe
underlying
tagsets
.
This
diversity
fuels
a
continuous
need
for
creating
semantic
annotation
data
anew
.
Accordingly
,
annotation
activities
will
persist
and
even
increase
in
number
as
HLT
is
expanding
on
various
technical
and
scientific
domains
(
e.g.
,
the
life
sciences
)
outside
the
classical
general-language
newspaper
genre
.
Since
the
provision
of
annotations
is
a
costly
,
labor-intensive
and
error-prone
process
the
amount
of
work
and
time
this
activity
requires
should
be
minimized
to
the
extent
that
corpus
data
could
still
be
used
to
effectively
train
ML-based
NLP
components
on
them
.
The
approach
we
advocate
does
exactly
this
and
yields
reduction
gains
(
compared
with
standard
procedures
)
ranging
between
48
%
to
72
%
,
without
seriously
sacrificing
annotation
quality
.
Various
techniques
to
minimize
the
necessary
amount
of
annotated
training
material
have
already
been
investigated
.
In
co-training
(
Blum
and
Mitchell
,
1998
)
,
e.g.
,
from
a
small
initial
set
of
labeled
data
multiple
learners
mutually
provide
new
training
material
for
each
other
by
labeling
unseen
examples
.
Pierce
and
Cardie
(
2001
)
have
shown
,
however
,
that
for
tasks
which
require
large
numbers
of
labeled
examples
-
such
as
most
NLP
tasks
-
co-training
might
be
inadequate
because
it
tends
to
generate
noisy
data
.
Furthermore
,
a
well
compiled
initial
training
set
is
a
crucial
prerequisite
for
successful
co-training
.
As
another
alternative
for
minimizing
annotation
work
,
active
learning
(
AL
)
is
based
on
the
idea
to
let
the
learner
have
control
over
the
examples
to
be
manually
labeled
so
as
to
optimize
the
prediction
accuracy
.
Accordingly
,
AL
aims
at
selecting
those
examples
with
high
utility
for
the
model
.
AL
(
as
well
as
semi-supervised
methods
)
is
typically
considered
as
a
learning
protocol
,
i.e.
,
to
train
a
particular
classifier
.
In
contrast
,
we
here
propose
to
employ
AL
as
a
corpus
annotation
method
.
A
corpus
built
on
these
premises
must
,
however
,
still
be
reusable
in
a
flexible
way
so
that
,
e.g.
,
training
with
modified
or
improved
classifiers
is
feasible
and
reasonable
on
AL-generated
corpora
.
Baldridge
and
Osborne
(
2004
)
have
already
argued
that
this
is
a
highly
critical
requirement
because
the
examples
selected
by
AL
are
tuned
to
one
particular
classifier
.
The
second
major
contribution
of
this
paper
ad
-
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
486-495
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
dresses
this
issue
and
provides
empirical
evidence
that
corpora
built
with
one
type
of
classifier
(
based
on
Maximum
Entropy
)
can
reasonably
be
reused
by
another
,
methodologically
related
type
of
classifier
(
based
on
Conditional
Random
Fields
)
without
requiring
changes
of
the
corpus
data
.
We
also
show
that
feature
sets
being
used
for
training
classifiers
can
be
enhanced
without
invalidating
corpus
annotations
generated
on
the
basis
of
AL
and
,
hence
,
with
a
poorer
feature
set
.
2
Related
Work
There
are
mainly
two
methodological
strands
of
AL
research
,
viz
.
optimization
approaches
which
aim
at
selecting
those
examples
that
optimize
some
(
algorithm-dependent
)
objective
function
,
such
as
prediction
variance
(
Cohn
et
al.
,
1996
)
,
and
heuristic
methods
with
uncertainty
sampling
(
Lewis
and
Catlett
,
1994
)
and
query-by-committee
(
QBC
)
(
Se-ung
et
al.
,
1992
)
just
to
name
the
most
prominent
ones
.
AL
has
already
been
applied
to
several
NLP
tasks
,
such
as
document
classification
(
Schohn
and
Cohn
,
2000
)
,
POS
tagging
(
Engelson
and
Dagan
,
1996
)
,
chunking
(
Ngai
and
Yarowsky
,
2000
)
,
statistical
parsing
(
Thompson
et
al.
,
1999
;
Hwa
,
2000
)
,
and
information
extraction
(
Lewis
and
Catlett
,
1994
;
Thompson
et
al.
,
1999
)
.
In
a
more
recent
study
,
Shen
et
al.
(
2004
)
consider
AL
for
entity
recognition
based
on
Support
Vector
Machines
.
Here
,
the
informativeness
of
an
example
is
estimated
by
the
distance
to
the
hyperplane
of
the
currently
learned
SVM
.
It
is
assumed
that
an
example
which
lies
close
to
the
hyperplane
has
high
chances
to
have
an
effect
on
training
.
This
approach
is
essentially
limited
to
the
SVM
learning
scheme
as
it
solely
relies
on
SVM-internal
selection
criteria
.
Hachey
et
al.
(
2005
)
propose
a
committee-based
AL
approach
where
the
committee
's
classifiers
constitute
multiple
views
on
the
data
by
employing
different
feature
subsets
.
The
authors
focus
on
(
possible
)
negative
side
effects
of
AL
on
the
annotations
.
They
argue
that
AL
annotations
are
cogni-tively
more
difficult
to
deal
with
for
the
annota-tors
(
because
of
the
increased
complexity
of
the
selected
sentences
)
.
Hence
,
lower
annotation
quality
and
higher
per-sentence
annotation
times
might
be
a
concern
.
There
are
controversial
findings
on
the
reusability
of
data
annotated
by
means
of
AL
for
the
problem
of
parse
tree
selection
.
Whereas
Hwa
(
2001
)
reports
positive
results
,
Baldridge
and
Osborne
(
2004
)
argue
that
AL
based
on
uncertainty
sampling
may
face
serious
performance
degradation
when
labeled
data
is
reused
for
training
a
classifier
different
from
the
one
employed
during
AL
.
For
committee-based
AL
,
however
,
there
is
a
lack
of
work
on
reusability
.
Our
experiments
of
committee-based
AL
for
entity
recognition
,
however
,
reveal
that
for
this
task
at
least
,
reusability
can
be
guaranteed
to
a
very
large
extent
.
3
AL
for
Corpus
Annotation
-
Requirements
for
Practical
Use
AL
frameworks
for
real-world
corpus
annotation
should
meet
the
following
requirements
:
fast
selection
time
cycles
—
AL-based
corpus
annotation
is
an
interactive
process
in
which
b
sentences
are
selected
by
the
AL
engine
for
human
annotation
.
Once
the
annotated
data
is
supplied
,
the
AL
engine
retrains
its
underlying
classifier
(
s
)
on
all
available
annotations
and
then
re-classifies
all
unseen
corpus
items
.
After
that
the
most
informative
(
i.e.
,
deviant
)
b
sentences
from
the
set
of
newly
classified
data
are
selected
for
the
next
iteration
round
.
In
this
approach
the
time
needed
to
select
the
next
examples
(
which
is
the
idle
time
of
the
human
an-notators
)
has
to
be
kept
at
an
acceptable
limit
of
a
few
minutes
only
.
There
are
various
AL
strategies
which
-
although
they
yield
theoretically
near-optimal
sample
selection
-
turn
out
to
be
actually
impractible
for
real-world
use
because
of
excessively
high
computation
times
(
cf.
Cohn
et
al.
(
1996
)
)
.
Thus
,
AL-based
annotation
should
be
based
on
a
computationally
tractable
and
task-wise
feasible
and
acceptable
selection
strategy
(
even
if
this
might
imply
a
suboptimal
reduction
of
annotation
costs
)
.
reusability
—
The
examples
AL
selects
for
manual
annotation
are
dependent
on
the
model
being
used
,
up
to
a
certain
extent
(
Baldridge
and
Osborne
,
2004
)
.
During
annotation
time
,
however
,
the
best
model
might
not
be
known
and
model
tuning
(
especially
the
choice
offeatures
)
is
typically
performed
once
a
training
corpus
is
available
.
Hence
,
from
a
practical
point
of
view
,
the
resulting
corpus
should
be
reusable
with
modified
classifiers
as
well
.
adaptive
stopping
criterion
—
An
explicit
and
adaptive
stopping
criterion
which
is
sensitive
towards
the
already
achieved
level
of
quality
of
the
annotated
corpus
is
clearly
preferred
over
stopping
after
an
a
priori
fixed
number
of
annotation
iterations
.
If
these
requirements
,
especially
the
first
and
the
second
one
,
cannot
be
guaranteed
for
a
specific
annotation
task
one
should
refrain
from
using
AL
.
The
efficiency
of
AL-driven
annotation
(
in
terms
of
the
time
needed
to
compile
high
quality
training
material
)
might
be
worse
compared
to
the
annotation
of
randomly
(
or
subjectively
)
selected
examples
.
4
Framework
for
AL-based
Named
Entity
Annotation
For
named
entity
recognition
(
NER
)
,
each
change
of
the
application
domain
requires
a
more
or
less
profound
change
of
the
types
of
semantic
categories
(
tags
)
being
used
for
corpus
annotation
.
Hence
,
one
may
encounter
a
lack
of
training
material
for
various
relevant
(
sub
)
domains
.
Once
this
data
is
available
,
however
,
one
might
want
to
modify
the
features
of
the
final
classifier
with
respect
to
the
specific
entity
types
.
Thus
,
a
corpus
annotated
by
means
of
AL
has
to
provide
the
flexibility
to
modify
the
features
of
the
final
classifier
.
To
meet
the
requirements
from
above
under
the
constraints
of
a
real-world
annotation
task
,
we
decided
for
QBC-based
AL
,
a
heuristic
AL
approach
,
which
is
computationally
less
complex
and
resource-greedy
than
objective
function
AL
methods
(
the
latter
explicitly
quantify
the
differences
between
the
current
and
an
ideal
classifier
in
terms
of
some
objective
function
)
.
Accordingly
,
we
ruled
out
uncertainty
sampling
,
another
heuristic
AL
approach
,
because
it
was
shown
before
that
QBC
is
more
efficient
and
robust
(
Freund
et
al.
,
1997
)
.
QBC
is
based
on
the
idea
to
select
those
examples
for
manual
annotation
on
which
a
committee
ofclas-sifiers
disagree
most
in
their
predictions
(
Engelson
and
Dagan
,
1996
)
.
A
committee
consists
of
a
number
of
k
classifiers
of
the
same
type
(
same
learning
algorithm
,
parameters
,
and
features
)
but
trained
on
different
subsets
of
the
training
data
.
QBC-based
AL
is
also
iterative
.
In
each
AL
round
the
committee
's
k
classifiers
are
trained
on
the
already
annotated
data
C
,
then
a
pool
of
unannotated
data
P
is
predicted
with
each
classifier
resulting
in
n
automatically
labeled
versions
of
P.
These
are
then
compared
according
to
their
labels
.
Those
with
the
highest
variance
are
selected
for
manual
annotation
.
4.1
Selection
Strategy
In
each
iteration
,
a
batch
of
b
examples
is
selected
for
manual
annotation
.
The
informativeness
of
an
example
is
estimated
in
terms
of
the
disagreement
,
i.e.
,
the
uncertainty
among
the
committee
's
classifiers
on
classifying
a
particular
example
.
This
is
measured
by
the
vote
entropy
(
Engelson
and
Dagan
,
1996
)
,
i.e.
,
the
entropy
of
the
distribution
of
classifications
assigned
to
an
example
by
the
classifiers
.
Vote
entropy
is
defined
on
the
token
level
t
as
:
where
^^r^
is
the
ratio
of
k
classifiers
where
the
label
li
is
assigned
to
a
token
t.
As
(
named
)
entities
often
span
more
than
a
single
text
token
we
consider
complete
sentences
as
a
reasonable
example
size
unit1
for
AL
and
calculate
the
disagreement
of
a
sentence
Dsent
as
the
mean
vote
entropy
of
its
single
tokens
.
Since
the
vote
entropy
is
minimal
when
all
classifiers
agree
in
their
vote
,
sentences
with
high
disagreement
are
preferred
for
manual
annotation
.
With
informed
decisions
of
human
anno-tators
made
available
,
the
potential
for
future
disagreement
of
the
classifier
committee
on
conflicting
instances
should
decrease
.
Thus
,
each
AL
iteration
selects
the
b
sentences
with
the
highest
disagreement
to
focus
on
the
most
controversial
decision
problems
.
Besides
informativeness
,
additional
criteria
can
be
envisaged
for
the
selection
of
examples
,
e.g.
,
di
-
1
Sentence-level
examples
are
but
one
conceivable
grain
size
-
lower
grains
(
such
as
clauses
or
phrases
)
as
well
as
higher
grains
(
e.g.
,
paragraphs
or
abstracts
)
are
equally
possible
,
with
different
implications
for
the
AL
process
.
feature
class
|
description
orthographical
lexical
and
morphological
prefix
and
suffix
of
length
3
,
stemmed
version
of
each
token
syntactic
the
token
's
part-of-speech
tag
contextual
features
of
neighboring
tokens
Table
1
:
Features
used
for
AL
versity
of
a
batch
and
representativeness
of
the
respective
example
(
to
avoid
outliers
)
(
Shen
et
al.
,
2004
)
.
We
experimented
with
these
more
sophisticated
selection
strategies
but
preliminary
experiments
did
not
reveal
any
significant
improvement
of
the
AL
performance
.
Engelson
and
Dagan
(
1996
)
confirm
this
observation
that
,
in
general
,
different
(
and
even
more
refined
)
selection
methods
still
yield
similar
results
.
Moreover
,
strategies
incorporating
more
selection
criteria
often
require
more
parameters
to
be
set
.
However
,
proper
parametrization
is
hard
to
achieve
in
real-world
applications
.
Using
disagreement
exclusively
for
selection
requires
only
one
parameter
,
viz
.
the
batch
size
b
,
to
be
specified
.
4.2
Classifier
and
Features
For
our
AL
framework
we
decided
to
employ
a
Maximum
Entropy
(
ME
)
classifier
(
Berger
et
al.
,
1996
)
.
We
employ
a
rich
set
of
features
(
see
Table
1
)
which
are
general
enough
to
be
used
in
most
(
sub
)
domains
for
entity
recognition
.
We
intentionally
avoided
using
features
such
as
semantic
triggers
or
external
dictionary
look-ups
because
they
depend
a
lot
on
the
specific
subdomain
and
entity
types
being
used
.
However
,
one
might
add
them
to
fin
-
tune
the
final
classifier
,
if
needed
.
ME
classifiers
outperform
their
generative
counterparts
(
e.g.
,
Naive
Bayesian
classifiers
)
because
they
can
easily
handle
overlapping
,
probably
dependent
features
which
might
be
contained
in
rich
feature
sets
.
We
also
favored
an
ME
classifier
over
an
SVM
one
because
the
latter
is
computationally
much
more
complex
on
rich
feature
sets
and
multiple
classes
and
is
thus
not
so
well
suited
for
an
interactive
process
like
AL
.
It
has
been
shown
that
Conditional
Random
Fields
(
CRF
)
(
Lafferty
et
al.
,
2001
)
achieve
higher
performance
on
many
NLP
tasks
,
such
as
NER
,
but
on
the
other
hand
they
are
computionally
more
complex
than
an
ME
classifier
making
them
also
impractical
for
the
interactive
AL
process
.
Thus
,
in
our
committee
we
employ
ME
classifiers
to
meet
requirement
1
(
fast
selection
time
cycles
)
.
However
,
in
the
end
we
want
to
use
the
annotated
corpora
to
train
a
CRF
and
will
thus
examine
the
reusability
of
such
an
ME-annotated
AL
corpus
for
CRFs
(
cf.
Subsection
5.2
)
.
4.3
Stopping
Criterion
A
question
hardly
addressed
up
until
now
is
when
to
actually
terminate
the
AL
process
.
Usually
,
it
gets
stopped
when
the
supervized
learning
performance
of
the
specific
classifier
is
achieved
.
The
problem
with
such
an
approach
is
,
however
,
that
in
practice
one
does
not
know
the
performance
level
which
could
possibly
be
achieved
on
an
unannotated
corpus
.
An
apparent
way
to
monitor
the
progress
of
the
annotation
process
is
to
periodically
(
e.g.
,
after
each
AL
iteration
)
train
a
classifier
on
the
data
annotated
so
far
and
evaluate
it
against
some
randomly
selected
gold
standard
.
When
the
relative
performance
growth
of
each
AL
iteration
falls
below
a
certain
threshold
this
might
be
a
good
reason
to
stop
the
annotation
.
Though
this
is
probably
the
most
reliable
way
,
it
is
impractical
for
many
scenarios
since
assembling
and
manually
annotating
a
representative
gold
standard
may
already
be
quite
a
laborious
task
.
Thus
,
a
measure
from
which
we
can
predict
the
development
of
the
learning
curve
would
be
beneficial
.
One
way
to
achieve
this
goal
is
to
monitor
the
rate
of
disagreement
among
the
different
classifiers
after
each
iteration
.
This
rate
will
descend
as
the
classifiers
get
more
and
more
robust
in
their
predictions
on
unseen
data
.
Thus
,
an
average
disagreement
approaching
zero
can
be
interpreted
as
an
indication
that
additional
annotations
will
not
render
any
further
improvement
.
In
our
experiments
,
we
will
show
that
this
is
a
valid
stopping
criterion
,
indeed
.
5
Experiments
and
Results
For
our
experiments
,
we
specified
the
following
three
parameters
:
the
batch
size
b
(
i.e.
,
the
number
of
sentences
to
be
selected
for
each
AL
iteration
)
,
the
size
and
composition
of
the
initial
train
-
ing
set
,
and
the
number
of
k
classifiers
in
a
committee
.
The
smaller
the
batch
size
,
the
higher
the
AL
performance
turns
out
to
be
.
In
the
special
case
of
batch
size
of
b
=
1
only
that
example
with
the
highest
disagreement
is
selected
.
This
is
certainly
impractical
since
after
each
AL
iteration
a
new
committee
of
classifiers
has
to
be
trained
causing
unwarranted
annotation
idle
time
.
We
found
b
=
20
to
be
a
good
compromise
between
the
annotators
'
idle
time
and
AL
performance
.
The
initial
training
set
also
contains
20
sentences
which
are
randomly
selected
though
.
Our
committee
consists
of
k
=
3
classifiers
,
which
is
a
good
trade-off
between
computational
complexity
and
diversity
.
Although
the
AL
iterations
were
performed
on
the
sentence
level
,
we
report
on
the
number
of
annotated
tokens
.
Since
sentences
may
considerably
vary
in
their
length
the
number
of
tokens
constitutes
a
better
measure
for
annotation
costs
.
We
ran
our
experiments
on
two
common
entity-annotated
corpora
from
two
different
domains
(
see
Table
2
)
.
From
the
general-language
newspaper
domain
,
we
used
the
English
data
set
of
the
CoNLL-2003
shared
task
(
Tjong
Kim
Sang
and
De
Meul-der
,
2003
)
.
It
consists
of
a
collection
of
newswire
articles
from
the
Reuters
Corpus,2
which
comes
annotated
with
three
entity
types
:
persons
,
locations
,
and
organizations
.
From
the
sublanguage
biology
domain
we
used
the
oncology
part
of
the
PENNBIOIE
corpus
which
consists
of
some
1150
PubMed
abstracts
.
Originally
,
this
corpus
contains
gene
,
variation
event
,
and
malignancy
entity
annotations
.
Manual
annotation
after
each
AL
round
was
simulated
by
moving
the
selected
sentences
from
the
pool
of
unannotated
sentences
P
to
the
training
corpus
T.
For
our
simulations
,
we
built
two
subcorpora
by
filtering
out
entity
annotations
:
the
PENNBIOIE
gene
corpus
(
PBgene
)
,
including
the
three
gene
entity
subtypes
generic
,
protein
,
and
rna
,
and
the
PENNBIOIE
variation
events
corpus
(
PB-var
)
corpus
including
the
variation
entity
subtypes
type
,
event
,
location
,
state-altered
,
state-generic
,
and
state-original
.
We
split
all
three
corpora
into
two
subsets
,
viz
.
AL
simulation
data
and
gold
standard
data
on
which
we
evaluate3
a
classifier
in
terms
3We
use
a
strict
evaluation
criterion
which
only
counts
exact
matches
as
true
positives
because
annotations
having
incorrect
data
set
sentences
CoNLL
3
entities
PBgene
3
entities
PBvar
6
entities
Table
2
:
Corpora
used
in
the
Experiments
of
f-score
trained
on
the
annotated
corpus
after
each
AL
iteration
(
learning
curve
)
.
As
far
as
the
CoNLL
corpus
is
concerned
,
we
have
used
CoNLL
's
training
set
for
AL
and
CoNLL
's
test
set
as
gold
standard
.
As
for
PBgene
and
PBvar
,
we
randomly
split
the
corpora
into
90
%
for
AL
and
10
%
as
gold
standard
.
In
the
following
experiments
we
will
refer
to
the
classifiers
used
in
the
AL
committee
as
selectors
,
and
the
classifier
used
for
evaluation
as
the
tester
.
5.1
Efficiency
of
AL
and
the
Applicability
of
the
Stopping
Criterion
In
a
first
series
of
experiments
,
we
evaluated
whether
AL-based
annotations
can
significantly
reduce
the
human
effort
compared
to
the
standard
annotation
procedure
where
sentences
are
selected
randomly
(
or
subjectively
)
.
We
also
show
that
disagreement
is
an
accurate
stopping
criterion
.
As
described
in
Section
4.2
,
we
here
employed
a
committee
of
ME
classifiers
for
AL
;
a
CRF
was
used
as
tester
for
both
the
AL
and
the
random
selection
.
Figures
1
,
2
,
and
3
depict
the
learning
curves
for
AL
selection
and
random
selection
(
upper
two
curves
)
and
the
respective
disagreement
curves
(
lower
curve
)
.
The
random
selection
curves
contained
in
these
plots
are
averaged
over
three
random
selection
runs
.
boundaries
are
insufficient
for
manual
corpus
annotation
.
Fi
gure
1
:
CoNLL
Corpus
:
Learning
/
Disagreement
Curves
Figure
2
:
PBgene
Corpus
:
Learning
/
Disagreement
Curves
Figure
3
:
PBvar
Corpus
:
Learning
/
Disagreement
Curves
selection
reduction
Table
3
:
Reduction
of
Annotation
Costs
Achieved
with
AL-based
Annotation
83
%
,
the
annotation
effort
can
be
reduced
by
about
53
%
using
AL
.
On
PBvar
,
an
f-score
of
about
80
%
is
reached
after
w
56,000
tokens
when
using
AL
selection
,
while
200,000
tokens
are
needed
with
random
selection
.
For
this
task
,
AL
reduces
the
annotation
effort
by
of
72
%
.
Here
,
the
disagreement
curve
approaches
values
of
zero
after
approximately
80,000
tokens
.
At
about
this
point
the
learning
curve
reaches
its
maximum
of
about
81
%
f-score
.
Table
3
summarizes
the
reduction
of
annotation
costs
achieved
on
all
three
corpora
.
Comparing
both
PENNBIOIE
simulations
,
obviously
,
the
reduction
of
annotation
costs
through
AL
is
much
higher
for
the
variation
type
entities
than
for
the
gene
entities
.
We
hypothesize
this
to
be
mainly
due
to
incomparable
entity
densities
.
Whereas
the
gene
entities
are
quite
frequent
(
about
1.3
per
sentence
on
average
)
,
the
variation
entities
are
rather
sparse
(
0.62
per
sentence
on
average
)
making
it
an
ideal
playground
for
AL-based
annotation
.
Our
experiments
also
reveal
that
disagreement
approaching
values
of
zero
is
a
valid
stopping
criterion
.
This
is
,
under
all
circumstances
,
definitely
the
point
when
AL-based
annotation
should
stop
because
then
all
classifiers
of
the
committee
vote
consistently
.
Any
further
selection
-
even
though
AL
selection
is
used
-
is
then
,
actually
,
a
random
selection
.
If
,
due
to
reasons
whatsoever
,
further
annotations
are
wanted
,
a
direct
switch
to
random
selection
is
advisable
because
this
is
computationally
less
expensive
than
AL-based
selection
.
To
evaluate
whether
the
proposed
AL
framework
for
named
entity
annotation
allows
for
flexible
re-use
of
the
annotated
data
,
we
performed
experiments
where
we
varied
both
the
learning
algorithms
and
the
features
of
the
selectors
.
AL
(
CRF
committee
)
AL
(
ME
committee
)
AL
(
NB
committee
)
random
selection
Figure
4
:
Algorithm
Flexibility
on
PBvar
Figure
6
:
Feature
Flexibility
on
PBvar
K
tokens
Figure
5
:
AlgorithmFflexibility
on
CoNLL
Figure
7
:
Feature
Flexibility
on
ConLL
First
,
we
analyzed
the
effect
of
different
probabilistic
classifiers
as
selectors
on
the
resulting
learning
curve
of
the
CRF
tester
.
Figures
4
and
5
show
the
learning
curves
on
our
original
ME
committee
,
a
CRF
committee
,
and
also
a
committee
of
Naive
Bayes
(
NB
)
classifiers
.
It
is
not
surprising
that
self-reuse
(
CRF
selectors
and
CRF
tester
)
yields
the
best
results
.
Switching
from
CRF
selectors
to
ME
selectors
has
almost
no
negative
effect
.
Even
with
a
committee
of
NB
selectors
(
an
ML
approach
which
is
essentially
less
well
suited
for
the
NER
task
)
,
ALbased
selection
is
still
substantially
more
efficient
than
random
selection
on
both
corpora
.
This
shows
that
our
approach
to
use
the
less
complex
ME
classifiers
for
the
AL
selection
process
has
the
positive
effect
of
fast
selection
cycle
times
at
almost
no
costs
.
This
is
especially
interesting
as
the
performance
of
an
ME
classifier
trained
in
supervized
manner
on
the
complete
corpus
is
significantly
worse
(
several
percentage
points
of
f-measure
)
than
a
CRF
.
That
means
,
even
though
an
ME
classifier
is
less
well
suited
as
the
final
classifier
,
it
works
well
as
a
selector
for
CRFs.4
Second
,
we
ran
experiments
on
selectors
with
only
some
features
and
our
CRF
tester
with
all
features
(
cf.
Table
1
)
.
Feature
subset
1
(
sub1
)
contains
all
but
the
syntactic
features
.
In
the
second
subset
(
sub2
)
,
also
morphological
and
lexical
features
are
missing
.
The
third
set
(
sub3
)
only
contains
orthographical
features
.
We
ran
an
AL
simulation
for
4We
have
also
conducted
experiments
where
we
varied
the
learning
algorithms
of
the
tester
(
we
experimented
with
NB
,
ME
,
MEMM
,
and
CRFs
)
-
with
comparable
results
.
In
a
realistic
scenario
,
however
,
on
would
rather
choose
a
CRF
as
final
tester
over
,
e.g.
,
a
NB
.
each
feature
subset
with
a
committee
of
CRF
se-lectors.5
Figures
6
and
7
show
the
various
learning
curves
.
Here
we
see
that
a
corpus
that
was
produced
with
AL
on
sub1
can
easily
be
re-used
by
a
tester
with
little
more
features
.
This
is
probably
the
most
realistic
scenario
:
the
core
features
are
kept
and
only
a
few
specific
features
(
e.g.
,
POS
,
a
dictionary
look-up
,
chunk
information
,
etc.
)
are
added
.
When
adding
substantially
more
features
to
the
tester
than
were
available
during
AL
time
,
the
respective
learning
curves
drop
down
towards
the
learning
curve
for
random
selection
.
But
even
with
a
selector
which
has
only
orthographical
features
and
a
tester
with
many
more
features
-
which
is
actually
quite
an
extreme
example
and
a
rather
unrealistic
scenario
for
a
real-world
application
-
AL
is
more
efficient
than
random
selection
.
However
,
the
limits
of
reusability
are
taking
shape
:
on
PBvar
,
the
AL
selection
with
sub3
converges
with
the
random
selection
curve
after
about
100,000
tokens
.
5.3
Findings
with
Real
AL
Annotation
We
currently
perform
AL
entity
mention
annotations
for
an
information
extraction
project
in
the
biomedical
subdomain
of
immunogenetics
.
For
this
purpose
,
sentences
)
as
our
document
pool
of
unlabeled
examples
from
PUBMED
.
By
means
of
random
subsam-pling
,
only
about
40,000
sentences
are
considered
in
each
round
of
AL
selection
.
To
regularly
monitor
classifier
performance
,
we
also
perform
gold
standard
(
GS
)
annotations
on
250
randomly
chosen
abstracts
(
w
2,200
sentences
)
.
In
all
our
annotations
of
different
entity
types
so
far
,
we
found
AL
learning
curves
similar
to
the
ones
reported
in
our
simulation
experiments
,
with
classifier
performance
levelling
off
at
around
75
%
-
85
%
f-score
(
depending
on
the
entity
type
)
.
Our
annotations
also
reveal
that
AL
is
especially
beneficial
when
entity
mentions
are
very
sparse
.
Figure
8
shows
the
cumulated
entity
density
on
AL
and
gold
standard
annotations
of
cytokine
receptors
(
specialized
proteins
for
which
we
annotated
six
different
entity
subtypes
)
-
very
sparse
entity
types
with
less
than
one
entity
mention
per
PUBMED
abstract
on
the
average
.
As
can
be
seen
,
after
2,000
5Here
,
we
employed
CRF
instead
of
ME
selectors
to
isolate
the
effect
of
feature
re-usability
.
Figure
8
:
Cumulated
Entity
Density
on
AL
and
GS
Annotations
of
Cytokine
Receptors
sentences
the
entity
density
in
our
AL
corpus
is
almost
15
times
higher
than
in
our
GS
corpus
.
Such
a
dense
corpus
may
be
more
appropriate
for
classifier
training
than
a
sparse
one
yielded
by
random
or
sequential
annotations
,
which
may
just
contain
lots
of
negative
training
examples
.
We
have
observed
comparable
effects
with
other
entity
types
,
too
,
and
thus
conclude
that
the
sparser
entity
mentions
of
a
specific
type
are
in
texts
,
the
more
beneficial
AL-based
annotation
is
.
We
report
on
other
aspects
of
AL
for
real
annotation
projects
in
Tomanek
et
al.
(
2007
)
.
6
Discussion
and
Conclusions
We
have
shown
,
for
the
annotation
of
(
named
)
entities
,
that
AL
is
well-suited
to
speed
up
annotation
work
under
realistic
conditions
.
In
our
simulations
we
yielded
gains
(
in
the
number
of
tokens
)
up
to
72
%
.
We
collected
evidence
that
an
average
disagreement
approaching
zero
may
serve
as
an
adaptive
stopping
criterion
for
AL-driven
annotation
and
that
a
corpus
compiled
by
means
of
QBC-based
AL
is
to
a
large
extent
reusable
by
modified
classifiers
.
These
findings
stand
in
contrast
to
those
supplied
by
Baldridge
and
Osborne
(
2004
)
who
focused
on
parse
selection
.
Their
research
indicates
that
AL
on
selectors
with
different
learning
algorithms
and
feature
sets
then
used
by
the
tester
can
easily
get
worse
than
random
selection
.
They
conclude
that
it
might
not
be
be
advisable
to
employ
AL
in
environments
where
the
final
classifier
is
not
very
stable
.
Our
evidence
leads
us
to
a
re-assessment
of
AL
-
based
annotations
.
First
,
we
employed
a
committee-based
(
QBC
)
while
Baldridge
and
Osborne
performed
uncertainty
sampling
AL
.
Committee-based
approaches
calculate
the
uncertainty
on
an
example
in
a
more
implicit
way
,
i.e.
,
by
the
disagreement
among
the
committee
's
classifiers
.
With
uncertainty
sampling
,
however
,
the
labeling
uncertainty
of
one
classifier
is
considered
directly
.
In
future
work
we
will
directly
compare
QBC
and
uncertainty
sampling
with
respect
to
data
reusability
.
Second
,
whereas
Baldridge
and
Osborne
employed
AL
on
a
scoring
or
ranking
problem
we
focused
on
classification
problems
.
Further
research
is
needed
to
investigate
whether
the
problem
class
(
classification
with
a
fixed
and
moderate
number
of
classes
vs.
ranking
large
numbers
of
possible
candidates
)
is
responsible
for
limited
data
reusability
.
On
the
basis
of
our
experiments
we
stipulate
that
the
proposed
AL
approach
might
be
applicable
with
comparable
results
to
a
wider
range
of
corpus
annotation
tasks
,
which
otherwise
would
require
substantially
larger
amounts
of
annotation
efforts
.
Acknowledgements
This
research
was
funded
by
the
EC
within
the
BOOTStrep
project
(
FP6-028099
)
,
and
by
the
German
Ministry
of
Education
and
Research
within
the
StemNet
project
(
01DS001A
to
1C
)
.
