In
this
paper
,
we
analyze
the
effect
of
resampling
techniques
,
including
under-sampling
and
over-sampling
used
in
active
learning
for
word
sense
disambiguation
(
WSD
)
.
Experimental
results
show
that
under-sampling
causes
negative
effects
on
active
learning
,
but
over-sampling
is
a
relatively
good
choice
.
To
alleviate
the
within-class
imbalance
problem
of
over-sampling
,
we
propose
a
bootstrap-based
over-sampling
(
BootOS
)
method
that
works
better
than
ordinary
over-sampling
in
active
learning
for
WSD
.
Finally
,
we
investigate
when
to
stop
active
learning
,
and
adopt
two
strategies
,
max-confidence
and
min-error
,
as
stopping
conditions
for
active
learning
.
According
to
experimental
results
,
we
suggest
a
prediction
solution
by
considering
max-confidence
as
the
upper
bound
and
min-error
as
the
lower
bound
for
stopping
conditions
.
1
Introduction
a
large
sense-tagged
corpus
is
very
expensive
and
time-consuming
,
because
these
data
have
to
be
annotated
by
human
experts
.
Among
the
techniques
to
solve
the
knowledge
bottleneck
problem
,
active
learning
is
a
promising
way
(
Lewis
and
Gale
,
1994
;
McCallum
and
Ni-gram
,
1998
)
.
The
purpose
of
active
learning
is
to
minimize
the
amount
of
human
labeling
effort
by
having
the
system
automatically
select
for
human
annotation
the
most
informative
unannotated
case
.
In
real-world
data
,
the
distribution
of
the
senses
of
a
word
is
often
very
skewed
.
Some
studies
reported
that
simply
selecting
the
predominant
sense
provides
superior
performance
,
when
a
highly
skewed
sense
distribution
and
insufficient
context
exist
(
Hoste
et
al.
,
2001
;
McCarthy
et
.
al.
,
2004
)
.
The
data
set
is
imbalanced
when
at
least
one
of
the
senses
is
heavily
underrepresented
compared
to
the
other
senses
.
In
general
,
a
WSD
classifier
is
designed
to
optimize
overall
accuracy
without
taking
into
account
the
class
imbalance
distribution
in
a
real-world
data
set
.
The
result
is
that
the
classifier
induced
from
imbalanced
data
tends
to
over-fit
the
predominant
class
and
to
ignore
small
classes
(
Japkowicz
and
Stephen
,
2002
)
.
Recently
,
much
work
has
been
done
in
addressing
the
class
imbalance
problem
,
reporting
that
resampling
methods
such
as
over-sampling
and
under-sampling
are
useful
in
supervised
learning
with
imbalanced
data
sets
to
induce
more
effective
classifiers
(
Estabrooks
et
al.
,
2004
;
Zhou
and
Liu
,
2006
)
.
In
general
framework
of
active
learning
,
the
learner
(
i.e.
supervised
classifier
)
is
formed
by
using
supervised
learning
algorithms
.
To
date
,
however
,
no-one
has
studied
the
effects
of
over-sampling
and
under-sampling
on
active
learning
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
783-790
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
methods
.
In
this
paper
,
we
study
active
learning
with
resampling
methods
addressing
the
class
imbalance
problem
for
WSD
.
It
is
noteworthy
that
neither
of
these
techniques
need
modify
the
architecture
or
learning
algorithm
,
making
them
very
easy
to
use
and
extend
to
other
domains
.
Another
problem
in
active
learning
is
knowing
when
to
stop
the
process
.
We
address
this
problem
in
this
paper
,
and
discuss
how
to
form
the
final
classifier
for
use
.
This
is
a
problem
of
estimation
of
classifier
effectiveness
(
Lewis
and
Gale
,
1994
)
.
Because
it
is
difficult
to
know
when
the
classifier
reaches
maximum
effectiveness
,
previous
work
used
a
simple
stopping
condition
when
the
training
set
reaches
desirable
size
.
However
,
in
fact
it
is
almost
impossible
to
predefine
an
appropriate
size
of
desirable
training
data
for
inducing
the
most
effective
classifier
.
To
solve
the
problem
,
we
consider
the
problem
of
estimation
of
classifier
effectiveness
as
a
second
task
of
estimating
classifier
confidence
.
This
paper
adopts
two
strategies
:
max-confidence
and
min-error
,
and
suggests
a
prediction
solution
by
considering
max-confidence
as
the
upper
bound
and
min-error
as
the
lower
bound
for
the
stopping
conditions
.
2
Related
Work
The
ability
of
the
active
learner
can
be
referred
to
as
selective
sampling
,
of
which
two
major
schemes
exist
:
uncertainty
sampling
and
committee-based
sampling
.
The
former
method
,
for
example
proposed
by
Lewis
and
Gale
(
1994
)
,
is
to
use
only
one
classifier
to
identify
unlabeled
examples
on
which
the
classifier
is
least
confident
.
The
latter
method
(
McCallum
and
Nigam
,
1998
)
generates
a
committee
of
classifiers
(
always
more
than
two
classifiers
)
and
selects
the
next
unlabeled
example
by
the
principle
of
maximal
disagreement
among
these
classifiers
.
With
selective
sampling
,
the
size
of
the
training
data
can
be
significantly
reduced
for
text
classification
(
Lewis
and
Gale
,
1994
;
McCallum
and
Nigam
,
1998
)
,
and
word
sense
disambiguation
(
Chen
,
et
al.
2006
)
.
A
method
similar
to
committee-based
sampling
is
co-testing
proposed
by
Muslea
et
al.
(
2000
)
,
which
trains
two
learners
individually
on
two
compatible
and
uncorrelated
views
that
should
be
able
to
reach
the
same
classification
accuracy
.
In
practice
,
however
,
these
conditions
of
view
selec
-
tion
are
difficult
to
meet
in
real-world
word
sense
disambiguation
tasks
.
Recently
,
much
work
has
been
done
on
the
class
imbalance
problem
.
The
well-known
approach
is
resampling
,
in
which
some
training
material
is
duplicated
.
Two
types
of
popular
resampling
methods
exist
for
addressing
the
class
imbalance
problem
:
over-sampling
and
under-sampling
.
The
basic
idea
of
resampling
methods
is
to
change
the
training
data
distribution
and
make
the
data
more
balanced
.
It
works
ok
in
supervised
learning
,
but
has
not
been
tested
in
active
learning
.
Previous
work
reports
that
cost-sensitive
learning
is
a
good
solution
to
the
class
imbalance
problem
(
Weiss
,
2004
)
.
In
practice
,
for
WSD
,
the
costs
of
various
senses
of
a
disambiguated
word
are
unequal
and
unknown
,
and
they
are
difficult
to
evaluate
in
the
process
of
learning
.
In
recent
years
,
there
have
been
attempts
to
apply
active
learning
for
word
sense
disambiguation
(
Chen
et
al.
,
2006
)
.
However
,
to
our
best
knowledge
,
there
has
been
no
such
attempt
to
consider
the
class
imbalance
problem
in
the
process
of
active
learning
for
WSD
tasks
.
3
Resampling
Methods
3.1
Under-sampling
Under-sampling
is
a
popular
method
in
addressing
the
class
imbalance
problem
by
changing
the
training
data
distribution
by
removing
some
exemplars
of
the
majority
class
at
random
.
Some
previous
work
reported
that
under-sampling
is
effective
in
learning
on
large
imbalanced
data
sets
(
Japkowicz
and
Stephen
,
2002
)
.
However
,
as
under-sampling
removes
some
potentially
useful
training
samples
,
it
could
cause
negative
effects
on
the
classifier
performance
.
One-sided
sampling
is
a
method
similar
to
under-sampling
,
in
which
redundant
and
borderline
training
examples
are
identified
and
removed
from
training
data
(
Kubat
and
Matwin
,
1997
)
.
Kuban
and
Matwin
reported
that
one-sided
sampling
is
effective
in
learning
with
two-class
large
imbal-anced
data
sets
.
However
,
the
relative
computational
cost
of
one-sided
sampling
in
active
learning
is
very
high
,
because
sampling
computations
must
be
implemented
for
each
learning
iteration
.
Our
primitive
experimental
results
show
that
,
in
the
multi-class
problem
of
WSD
,
one-sided
sampling
degrades
the
performance
of
active
learning
.
And
due
to
the
high
computation
complexity
of
onesided
sampling
,
we
use
random
under-sampling
in
our
comparison
experiments
instead
.
To
control
the
degree
of
change
of
the
training
data
distribution
,
the
ratio
of
examples
from
the
majority
and
the
minority
class
after
removal
from
the
majority
class
is
called
the
removal
rate
(
Jo
and
Japkowicz
,
2004
)
.
If
the
removal
rate
is
1.0
,
then
under-sampling
methods
build
data
sets
with
complete
class
balance
.
However
,
it
was
reported
previously
that
perfect
balance
is
not
always
the
optimal
rate
(
Estabrooks
et
al.
,
2004
)
.
In
our
comparison
experiments
,
we
set
the
removal
rate
for
under-sampling
to
0.8
,
since
some
cases
have
0.8
as
the
optimal
rate
reported
in
(
Estabrooks
et
al.
,
2004
)
.
Over-sampling
is
also
a
popular
method
in
addressing
the
class
imbalance
problem
by
resampling
the
small
class
until
it
contains
as
many
examples
as
the
large
one
.
In
contrast
to
under-sampling
,
over-sampling
is
the
process
of
adding
examples
to
the
minority
class
,
and
is
accomplished
by
random
sampling
and
duplication
.
Because
the
process
of
over-sampling
involves
making
exact
copies
of
examples
,
it
usually
increases
the
training
cost
and
may
lead
to
overfit-ting
.
There
is
a
recent
variant
of
over-sampling
named
SMOTE
(
Chawla
et
al.
,
2002
)
which
is
a
synthetic
minority
over-sampling
technique
.
The
authors
reported
that
a
combination
of
SMOTE
and
under-sampling
can
achieve
better
classifier
performance
in
ROC
space
than
only
under-sampling
the
majority
class
.
In
our
comparison
experiments
,
we
use
over-sampling
,
measured
by
a
resampling
rate
called
the
addition
rate
(
Jo
and
Japkowicz
,
2004
)
that
indicates
the
number
of
examples
that
should
be
added
into
the
minority
class
.
The
addition
rate
for
over-sampling
is
also
set
to
0.8
in
our
experiments
.
3.3
Bootstrap-based
Over-sampling
While
over-sampling
decreases
the
between-class
imbalance
,
it
increases
the
within-class
imbalance
(
Jo
and
Japkowicz
,
2004
)
because
of
the
increase
of
exact
copies
of
examples
at
random
.
To
alleviate
this
within-class
imbalance
problem
,
we
propose
a
bootstrap-based
over-sampling
method
(
BootOS
)
that
uses
a
bootstrap
resampling
technique
in
the
process
of
over-sampling
.
Bootstrap
-
ping
,
explained
below
,
is
a
resampling
technique
similar
to
jackknifing
.
There
are
two
reasons
for
choosing
a
bootstrap
method
as
resampling
technique
in
the
process
of
over-sampling
.
First
,
using
a
bootstrap
set
can
avoid
exactly
copying
samples
in
the
minority
class
.
Second
,
the
bootstrap
method
may
give
a
smoothing
of
the
distribution
of
the
training
samples
(
Hamamoto
et
al.
,
1997
)
,
which
can
alleviate
the
within-class
imbalance
problem
cased
by
over-sampling
.
To
generate
the
bootstrap
set
,
we
use
a
well-known
bootstrap
technique
proposed
by
Hama-moto
et
al.
(
1997
)
that
does
not
select
samples
randomly
,
allowing
all
samples
in
the
minority
class
(
es
)
an
equal
chance
to
be
selected._
Find
the
k
nearest
neighbor
samples
xj,1
,
xj,2
,
.
,
xj
,
k
using
similarity
functions
.
Compute
a
bootstrap
sample
xBi
:
Figure
1
.
The
BootOS
algorithm
Active
Learning
with
Resampling
In
this
work
,
we
are
interested
in
selective
sampling
for
pool-based
active
learning
,
and
focus
on
uncertainty
sampling
(
Lewis
and
Gale
,
1994
)
.
The
key
point
is
how
to
measure
the
uncertainty
of
an
unlabeled
exemplar
,
and
select
a
new
exemplar
with
maximum
uncertainty
to
augment
the
training
data
.
The
maximum
uncertainty
implies
that
the
current
classifier
has
the
least
confidence
in
its
classification
of
this
exemplar
.
The
well-known
entropy
is
a
good
uncertainty
measurement
widely
where
U
is
the
uncertainty
measurement
function
H
represents
the
entropy
function
.
In
the
WSD
task
,
p
(
sj
\
wi
)
is
the
predicted
probability
of
sense
sj
outputted
by
the
current
classifier
,
when
given
a
sample
i
containing
a
disambiguated
word
w
,
.
Algorithm
Active-Learning-with-Resampling
(
L
,
U
,
m
)
Input
:
Let
L
be
initial
small
training
data
set
;
U
the
pool
of
unlabeled
exemplars
Output
:
labeled
training
data
set
L
Resample
L
to
generate
new
training
data
set
L
*
using
resampling
techniques
such
as
under-sampling
,
over-sampling
or
BootOS
,
and
then
use
L
*
to
train
the
initial
classifier
Loop
while
adding
new
instances
into
L
a.
use
the
current
classifier
to
probabilistically
label
all
unlabeled
exemplars
in
U
b.
Based
on
active
learning
rules
,
present
m
top-ranked
exemplars
to
oracle
for
labeling
c.
Augment
L
with
the
m
new
exemplars
,
and
remove
them
from
U
d.
Resample
L
to
generate
new
training
data
set
L
*
using
resampling
techniques
such
as
under-sampling
,
over-sampling
,
or
BootOS
,
and
use
L
*
to
retrain
the
current
classifier
Until
the
predefined
stopping
condition
is
met
.
Figure
2
.
Active
learning
with
resampling
In
step
1
and
2
(
d
)
in
Fig
.
2
,
if
we
do
not
generate
L
*
,
and
L
is
used
directly
to
train
the
current
classifier
,
we
call
it
ordinary
active
learning
.
In
the
process
of
active
learning
,
we
used
the
entropy-based
uncertainty
measurement
for
all
active
learning
frameworks
in
our
comparison
experiments
.
Actually
our
active
learning
with
resampling
is
a
heterogeneous
approach
in
which
the
classifier
used
to
select
new
instances
is
different
from
the
resulting
classifier
(
Lewis
and
Catlett
,
1994
)
.
We
utilize
a
maximum
entropy
(
ME
)
model
(
Berger
et
al.
,
1996
)
to
design
the
basic
classifier
used
in
active
learning
for
WSD
.
The
advantage
of
the
ME
model
is
the
ability
to
freely
incorporate
features
from
diverse
sources
into
a
single
,
well-grounded
statistical
model
.
A
publicly
available
ME
toolkit
(
Zhang
et
.
al.
,
2004
)
was
used
in
our
experiments
.
In
order
to
extract
the
linguistic
features
necessary
for
the
ME
model
,
all
sentences
containing
the
target
word
were
automatically
part
-
of-speech
(
POS
)
tagged
using
the
Brill
POS
tagger
(
Brill
,
1992
)
.
Three
knowledge
sources
were
used
to
capture
contextual
information
:
unordered
single
words
in
topical
context
,
POS
of
neighboring
words
with
position
information
,
and
local
collocations
.
These
are
same
as
three
of
the
four
knowledge
sources
used
in
(
Lee
and
Ng
,
2002
)
.
Their
fourth
knowledge
source
(
named
syntactic
relations
)
was
not
used
in
our
work
.
5
Stopping
Conditions
In
active
learning
algorithm
,
defining
the
stopping
condition
for
active
learning
is
a
critical
problem
,
because
it
is
almost
impossible
for
the
human
an-notator
to
label
all
unlabeled
samples
.
This
is
a
problem
of
estimation
of
classifier
effectiveness
(
Lewis
and
Gale
1994
)
.
In
fact
,
it
is
difficult
to
know
when
the
classifier
reaches
maximum
effectiveness
.
In
previous
work
some
researchers
used
a
simple
stopping
condition
when
the
training
set
reached
a
predefined
desired
size
.
It
is
almost
impossible
to
predefine
an
appropriate
size
of
desirable
training
data
for
inducing
the
most
effective
classifier
.
To
solve
the
problem
,
we
consider
the
problem
of
estimating
classifier
effectiveness
as
the
problem
of
confidence
estimation
of
classifier
on
the
remaining
unlabeled
samples
.
Concretely
,
if
we
find
that
the
current
classifier
already
has
acceptably
strong
confidence
on
its
classification
results
for
all
remained
unlabeled
data
,
we
assume
the
current
training
data
is
sufficient
to
train
the
classifier
with
maximum
effectiveness
.
In
other
words
,
if
a
classifier
induced
from
the
current
training
data
has
strong
classification
confidence
on
an
unlabeled
example
,
we
could
consider
it
as
a
redundant
example
.
Based
on
above
analyses
,
we
adopt
here
two
stopping
conditions
for
active
learning
:
•
Max-confidence
:
This
strategy
is
based
on
uncertainty
measurement
,
considering
whether
the
entropy
of
each
selected
unlabeled
example
is
less
than
a
very
small
predefined
threshold
close
to
zero
,
such
as
0.001
.
•
Min-error
:
This
strategy
is
based
on
feedback
from
the
oracle
when
the
active
learner
asks
for
true
labels
for
selected
unlabeled
examples
,
considering
whether
the
current
trained
classifier
could
correctly
predict
the
labels
or
the
accuracy
performance
of
predictions
on
selected
unlabeled
examples
is
already
larger
than
a
predefined
accuacy
threshold
.
Once
max-confidence
and
min-error
conditions
are
met
,
the
current
classifier
is
assumed
to
have
strong
enough
confidence
on
the
classification
results
of
all
remained
unlabeled
data
.
Evaluation
The
data
used
for
our
comparison
experiments
were
developed
as
part
of
the
OntoNotes
project
(
Hovy
et
al.
,
2006
)
,
which
uses
the
WSJ
part
of
the
Penn
Treebank
(
Marcus
et
al.
,
1993
)
.
The
senses
of
noun
words
occurring
in
OntoNotes
are
linked
to
the
Omega
ontology
.
In
OntoNotes
,
at
least
two
humans
manually
annotate
the
coarse-grained
senses
of
selected
nouns
and
verbs
in
their
natural
sentence
context
.
To
date
,
OntoNotes
has
annotated
several
tens
of
thousands
of
examples
,
covering
several
hundred
nouns
and
verbs
,
with
an
inter-annotator
agreement
rate
of
at
least
90
%
.
Those
38
random
chosen
ambiguous
nouns
used
in
all
following
experiments
are
shown
in
Table
1
.
It
is
apparent
that
the
sense
distributions
of
most
nouns
are
very
skewed
(
frequencies
shown
in
the
sense
distribution
president
director
management
activity
building
development
In
the
following
active
learning
comparison
experiments
,
we
tested
with
five
resampling
methods
including
random
sampling
(
Random
)
,
uncertainty
sampling
(
Ordinary
)
,
under-sampling
,
over-sampling
,
and
BootOS
.
The
1-NN
technique
was
used
for
bootstrap-based
resampling
of
BootOS
in
our
experiments
.
A
5
by
5-fold
cross-validation
was
performed
on
each
noun
's
data
.
We
used
20
%
randomly
chosen
data
for
held-out
evaluation
and
the
other
80
%
as
the
pool
of
unlabeled
data
for
each
round
of
the
active
learning
.
For
all
words
,
we
started
with
a
randomly
chosen
initial
training
set
of
10
examples
,
and
we
made
10
queries
after
each
learning
iteration
.
In
the
evaluation
,
average
accuracy
and
recall
are
used
as
measures
of
performances
for
each
active
learning
method
.
Note
that
the
macro-average
way
is
adopted
for
recall
evaluation
in
each
noun
WSD
task
.
The
accuracy
measure
indicates
the
percentage
of
testing
instances
correctly
identified
by
the
system
.
The
macro-average
recall
measure
indicates
how
well
the
system
performs
on
each
sense
.
Experiment
1
:
Performance
comparison
experiments
on
active
learning
Figure
3
.
Average
accuracy
performance
comparison
experiments
Active
learning
for
WSD
Figure
4
.
Average
recall
performance
comparison
experiments
As
shown
in
Fig
.
3
and
Fig
.
4
,
when
the
number
of
learned
samples
for
each
noun
is
smaller
than
120
,
the
BootOS
has
the
best
performance
,
followed
by
over-sampling
and
ordinary
method
.
As
the
number
of
learned
samples
increases
,
ordinary
,
over-sampling
and
BootOS
have
similar
performances
on
accuracy
and
recall
.
Our
experiments
also
exhibit
that
random
sampling
method
is
the
worst
on
both
accuracy
and
recall
.
Previous
work
(
Estabrooks
et
al.
,
2004
)
reported
that
under-sampling
of
the
majority
class
(
predominant
sense
)
has
been
proposed
as
a
good
means
of
increasing
the
sensitivity
of
a
classifier
to
the
minority
class
(
infrequent
sense
)
.
However
,
in
our
active
learning
experiments
,
under-sampling
is
apparently
worse
than
ordinary
,
over-sampling
and
our
BootOS
.
The
reason
is
that
in
highly
imbal-anced
data
,
too
many
useful
training
samples
of
majority
class
are
discarded
in
under-sampling
,
causing
the
performance
of
active
learning
to
degrade
.
Experiment
2
:
Effectiveness
of
learning
instances
for
infrequent
senses
It
is
important
to
enrich
the
corpora
by
learning
more
instances
for
infrequent
senses
using
active
learning
with
less
human
labeling
.
This
procedure
not
only
makes
the
corpora
'
richer
'
,
but
also
alleviates
the
domain
dependence
problem
faced
by
corpus-based
supervised
approaches
to
WSD
.
The
objective
of
this
experiment
is
to
evaluate
the
performance
of
active
learning
in
learning
samples
of
infrequent
senses
from
an
unlabeled
corpus
.
Due
to
highly
skewed
word
sense
distributions
in
our
data
set
,
we
consider
all
senses
other
than
the
predominant
sense
as
infrequent
senses
in
this
experiment
.
u
,
Active
learning
for
WSD
»
0.45
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Number
of
learned
samples
Figure
5
.
Comparison
experiments
on
learning
instances
for
infrequent
senses
Fig
.
5
shows
that
random
sampling
is
the
worst
in
active
learning
for
infrequent
senses
.
The
reason
is
very
obvious
:
the
sense
distribution
of
the
learned
sample
set
by
random
sampling
is
almost
identical
to
that
of
the
original
data
set
.
Under-sampling
is
apparently
worse
than
ordinary
active
learning
,
over-sampling
and
BootOS
methods
.
When
the
number
of
learned
samples
for
each
noun
is
smaller
than
80
,
BootOS
achieves
slight
better
performance
than
ordinary
active
learning
and
over-sampling
.
When
the
number
of
learned
samples
is
larger
than
80
and
smaller
than
160
,
these
three
methods
exhibit
similar
performance
.
As
the
number
of
iterations
increases
,
ordinary
active
learning
is
slightly
better
than
over-sampling
and
BootOS
.
In
fact
,
after
the
16th
iteration
(
10
samples
chosen
in
each
iteration
)
,
results
indicate
that
most
instances
for
infrequent
senses
have
been
learned
.
Experiment
3
:
Effectiveness
of
Stopping
Conditions
for
active
learning
To
evaluate
the
effectiveness
of
two
strategies
max-confidence
and
min-error
as
stopping
conditions
of
active
learning
,
we
first
construct
an
ideal
stopping
condition
when
the
classifier
could
reach
the
highest
accuracy
performance
at
the
first
time
in
the
procedure
of
active
learning
.
When
the
ideal
stopping
condition
is
met
,
it
means
that
the
current
classifier
has
reached
maximum
effectiveness
.
In
practice
,
it
is
impossible
to
exactly
know
when
the
ideal
stopping
condition
is
met
before
all
unlabeled
data
are
labeled
by
a
human
annotator
.
We
only
use
this
ideal
method
in
our
comparison
experiments
to
analyze
the
effectiveness
of
our
two
proposed
stopping
conditions
.
For
general
purpose
,
we
focus
on
the
ordinary
active
learning
to
design
the
basic
system
,
and
to
evaluate
the
effectiveness
of
three
stop
conditions
.
In
the
following
experiments
,
the
entropy
threshold
used
in
max-confidence
strategy
is
set
to
0.001
,
and
the
accuracy
threshold
used
in
min-error
strategy
is
set
to
0.9
.
In
Table
2
,
the
column
"
Size
"
"
stands
for
the
size
of
unlabeled
data
set
of
corresponding
noun
word
used
in
active
learning
.
There
are
two
columns
for
each
stopping
condition
:
the
left
column
"
num
"
presents
number
of
learned
instances
and
the
right
column
"
%
"
presents
its
percentage
over
all
data
when
the
corresponding
stopping
condition
is
met
.
Max-confidence
Min-error
Management
Position
administration
Development
Strategy
President
Director
Activity
Building
Table
2
Effectiveness
of
three
stopping
conditions
As
shown
in
Table
2
,
the
min-error
strategy
based
on
feedback
of
human
annotator
is
very
close
to
the
ideal
method
.
Therefore
,
when
comparing
to
ideal
stopping
condition
,
min-error
strategy
is
a
good
choice
as
stopping
condition
for
active
learning
.
It
is
important
to
note
that
the
min-error
method
does
not
need
more
additional
computational
costs
,
it
only
depends
upon
the
feedback
of
human
annotator
when
labeling
the
chosen
unlabeled
samples
.
From
experimental
results
,
we
can
see
that
max-confidence
strategy
is
worse
than
min-error
method
.
However
,
we
believe
that
the
entropy
of
each
unlabeled
sample
is
a
good
signal
to
stop
active
learning
.
So
we
suggest
that
there
may
be
a
good
prediction
solution
in
which
the
min-error
strategy
is
used
as
the
lower-bound
of
stopping
condition
,
and
max-confidence
strategy
as
the
upper-bound
of
stopping
condition
for
active
learning
.
7
Discussion
As
discussed
above
,
finding
more
instances
for
infrequent
senses
at
the
earlier
stages
of
active
learning
is
very
significant
in
making
the
corpus
richer
,
meaning
less
effort
for
human
labeling
.
In
practice
,
another
way
to
learn
more
instances
for
infrequent
senses
is
to
first
build
a
training
data
set
by
active
learning
or
by
human
efforts
,
and
then
build
a
supervised
classifier
to
find
more
instances
for
infrequent
sense
.
However
,
it
is
interesting
to
know
how
much
initial
training
data
is
enough
for
this
task
,
and
how
much
human
labeling
efforts
could
be
saved
.
From
experimental
results
,
we
found
that
among
these
chosen
unlabeled
instances
by
active
learner
,
some
instances
are
informative
samples
helpful
for
improving
classification
performance
,
and
other
instances
are
borderline
samples
which
are
unreliable
because
even
a
small
amount
of
noise
can
lead
the
sample
to
the
wrong
side
of
the
decision
boundary
.
The
removal
of
these
borderline
samples
might
improve
the
performance
of
active
learning
.
The
proposed
prediction
solution
based
on
max-confidence
and
min-error
strategies
is
a
coarse
framework
.
To
predict
when
to
stop
active
learning
procedure
,
it
is
logical
to
consider
the
changes
of
accuracy
performance
of
the
classifier
as
a
signal
to
stop
the
learning
iteration
.
In
other
words
,
during
the
range
predicted
by
the
proposed
solution
,
if
the
change
of
accuracy
performance
of
the
learner
(
classifier
)
is
very
small
,
we
could
assume
that
the
current
classifier
has
reached
maximum
effectiveness
.
8
Conclusion
and
Future
Work
In
this
paper
,
we
consider
the
class
imbalance
problem
in
WSD
tasks
,
and
analyze
the
effect
of
resampling
techniques
including
over-sampling
and
under-sampling
in
active
learning
.
Experimental
results
show
that
over-sampling
is
a
relatively
good
choice
in
active
learning
for
WSD
in
highly
imbalanced
data
.
Under-sampling
causes
negative
effect
on
active
learning
.
A
new
over-sampling
method
named
BootOS
based
on
bootstrap
technique
is
proposed
to
alleviate
the
within-class
imbalance
problem
of
over-sampling
,
and
works
better
than
ordinary
over-sampling
in
active
learning
for
WSD
.
It
is
noteworthy
that
none
of
these
techniques
require
to
modify
the
architecture
or
learning
algorithm
;
therefore
,
they
are
very
easy
to
use
and
extend
to
other
applications
.
To
predict
when
to
stop
active
learning
,
we
adopt
two
strategies
including
max-confidence
and
min-error
as
stopping
conditions
.
According
to
our
experimental
results
,
we
suggest
a
prediction
solution
by
considering
max-confidence
as
the
upper
bound
and
min-error
as
the
lower
bound
of
stopping
conditions
for
active
learning
.
In
the
future
work
,
we
will
study
how
to
exactly
identify
these
borderline
samples
thus
they
are
not
firstly
selected
in
active
learning
procedure
.
The
borderline
samples
have
the
higher
entropy
values
meaning
least
confident
for
the
current
classifier
.
The
borderline
instances
can
be
detected
using
the
concept
of
Tomek
links
(
Tomek
1976
)
.
It
is
also
worth
studying
cost-sensitive
learning
for
active
learning
with
imbalanced
data
,
and
using
such
techniques
for
WSD
.
